SlideShare a Scribd company logo
1 of 6
Download to read offline
2011 Workshops of International Conference on Advanced Information Networking and Applications



                               K-means clustering in the cloud - a Mahout test
                   Rui Máximo Esteves *                                                       Chunming Rong*
                    rui.esteves@uis.no                                                      chunming.rong@uis.no

                         Rui Pais *                                        * Department of Electrical and Computer Engineering,
                    rma.pais@stud.uis.no                                   University of Stavanger, Norway



 Abstract—. The K-Means is a well known clustering algorithm
 that has been successfully applied to a wide variety of problems.                                  II.   THEORY
 However, its application has usually been restricted to small
 datasets. Mahout is a cloud computing approach to K-Means that               Clustering
 runs on a Hadoop system. Both Mahout and Hadoop are free
 and open source. Due to their inexpensive and scalable
                                                                               Clustering refers to the process of grouping samples into
 characteristics, these platforms can be a promising technology to
 solve data intensive problems which were not trivial in the past.         different classes based on their similarities. Samples within a
 In this work we studied the performance of Mahout using a large           class have high similarity in comparison to one another but
 data set. The tests were running on Amazon EC2 instances and              are very dissimilar to samples in other classes. These groups
 allowed to compare the gain in runtime when running on a multi            or classes are called clusters. Clustering algorithms can
 node cluster. This paper presents some results of ongoing                 divide the samples into clusters automatically without any
 research.                                                                 preconceptions about what kind of groupings should be
                                                                           found. Within the context of machine learning, clustering is
                                                                           considered to be a form of unsupervised learning since there
                                                                           is no target variable to guide the learning process. Unlike
     Keywords- K-means, cloud computing, mahout, map reduce                classification, clustering and unsupervised learning do not
                                                                           rely on predefined classes and class-labeled training
                        I.    INTRODUCTION                                 examples. For this reason, clustering is a form of learning by
                                                                           observation, rather than learning by examples [2]. The
     The K-means is one of the most frequently used clustering             number of clusters, size, and shape are not in general known
 algorithms. It is simple and straightforward and has been                 in anticipation, and each of these parameters must be
 successfully applied during the last few years. Under the                 determined by either the user or the clustering algorithm [3].
 assumption that datasets tend to be small, research on                        Clustering has been used in many areas, including data
 clustering algorithms has traditionally focused on improving              mining, statistics, biology, and machine learning. Clustering
 the quality of clustering. However, many datasets now are                 can also be used for outlier detection, where outliers (values
 large and cannot fit into main memory [1].                                that are “far away” from any cluster) may be more
     Mahout/Hadoop can be a promising and inexpensive                      interesting than common cases [2]. One useful application
 solution to solve problems with large data sets. However,                 example is fault detection [4, 5]. Other examples of
 there is a lack of studies regarding its performance. With this           applications can be found in [2, 6]. The concept of clustering
 work we wanted to study the gain in runtime when using this               is not particularly new, but it is still an important topic of
 solution and also to test if there was any loss in the clusters           research. A useful survey about recent advances can be
 quality. To meet those aims we used Amazon EC2 instances                  found in [7].
 and set up a small cluster. We chose to study K-means since
 it is one of the most popular and easy algorithms.                           Cluster algorithms types
     The rest of this paper is organized in the following way:
 section II presents some theoretical foundations. The                         The two major types of cluster algorithms are
 experiments are described in section III and section IV the               hierarchical and partitional.
 results, interpretation and discussion are presented. Future                  The first type produces a hierarchical decomposition of
 work is in section V and the final conclusions are in section             the data set into partitions. It merges or splits the partitions
 VI.                                                                       into clusters using a similarity criterion.
                                                                               In the second type, an algorithm creates an initial division
                                                                           of the data set, according to a given number of partitions. It
                                                                           then uses an iterative relocation technique that attempts to
                                                                           improve the partitioning by moving objects from one group
                                                                           to another. The general criterion for partitioning is a
                                                                           combination of high similarity of the samples inside of
                                                                           clusters with high dissimilarity between distinct clusters.
                                                                           This means that samples in the same cluster are “close” or

978-0-7695-4338-3/11 $26.00 © 2011 IEEE                              514
DOI 10.1109/WAINA.2011.136
related to each other, whereas samples of different clusters            the input to be processed by the reduce tasks [14]. Hadoop
are “far apart” or very different. [2, 6]                               uses a distributed file system HDFS, which creates multiple
    Each clustering type is better suited than to a particular          replicas of data blocks and distributes them on computer
type of problem. Abbas [8] concluded that partitional                   nodes throughout a cluster to enable reliable, extremely rapid
algorithms are suited for large data set while hierarchical are         computations [15].
for small data sets. Singla, Yadav and Sing [9] presented                   Mahout3 is a machine learning library that runs over a
some arguments in favor of partitional algorithm. According             Hadoop system. It has a collection of algorithms to solve
to them, there is no need to determine the full dendrogram              clustering, classification and prediction problems. It uses
with a partitional algorithm. They also found that the time             MapReduce paradigm which in combination with Hadoop
complexity of partitonal algorithm is little compared with              can be used as an inexpensive solution to solve machine
hierarchical. Zhao and Kaprypis [10] had similar results. On            learning problems.
the other end, hierarchical algorithms have the advantages of               Mahout is a project still in development which has been
an easily interpreted graphical representation and ignorance            used mainly for recommendation engines, document
of influence from initialization and local minima [11].                 classifiers and to solve other web typical problems. At the
                                                                        moment, its usage for clustering is not sufficiently explored.
       K-means                                                              Hadoop and Mahout are free and open source projects.
                                                                        Due to their inexpensive and scalable characteristics, these
    K-means clustering is a widely used partition algorithm.            platforms can be a promising technology to solve data
It partitions the samples into clusters by minimizing a                 intensive problems which were not trivial in the past.
measure between the samples and the centroid of the                         However, it is important to study the tradeoff between
clusters. Euclidean distance is a similarity criterion                  the overhead of using Map/Reduce and the gain of
extensively used for samples in Euclidean space. Manhattan              performance by distributing the computation. It is also
distance is a simpler measure also for Euclidean space.                 essential to check if the clusters maintain their quality.
Cosine distance and Jaccard is often employed for                           Mahout contains various implementations of clustering,
documents clustering. [12]                                              like K-means, fuzzy K-means, meanshift and Dirichlet
    The algorithm for the standard K-means is given as                  among others. To input the data for Mahout clustering it is
follows [13]:                                                           necessary to do some procedures first. If the data is not
    1. Choose a number of clusters k                                    numerical it has to be first preprocessed. It is required then to
    2. Initialize centroid1 c1,… ck                                     create vectors. If the data set is sparse it allows the user to
    3. For each data point, compute the centroid it is closest          create sparse vectors that are much more compact. The
         to (using some distance measure) and assign the data           vectors are finally converted to a specific Hadoop file format
         point to this cluster.                                         that is SequenceFile.
    4. Re-compute centroids (mean of data points in                         The K-means clustering algorithm takes the following
         cluster)                                                       input parameters:
    5. Stop when there are no new re-assignments.                           - A SequenceFile containing the input vectors.
                                                                            - A SequenceFile containing the initial cluster centers. If
    The K-means clustering is simple but it has high time               not present it will attribute them randomly.
complexity when the data sets are large. In these                           - A similarity measure to be used.
circumstances the memory of a single machine can be a                       - The convergence Threshold that will be the stopping
restriction. As a consequence it has not been used in the past          condition of the K-means. If in a particular iteration, any
with large data sets [13].                                              centers of the clusters do not change beyond that threshold,
                                                                        then no further iterations are done.
              Hadoop/Mahout solution                                        - The maximum number of iterations to be processed if it
                                                                        does not converge first.
    Hadoop 2 is a software framework which allows the                       - The number of reducers to be used. This value
effortless development of cloud computing systems. It                   determines the parallelism of the execution.
supports data intensive distributed applications on large                   - The vector implementation used for the input files.
clusters built of commodity hardware. This framework is                     As output the user gets the centroids coordinates and the
designed to be scalable, which allows the user to add more              samples attributed to each cluster. The output files are in
nodes as the application requires. Hadoop uses a distributed            SequenceFile format. Mahout provides the necessary tools
computing paradigm named MapReduce.                                     for file conversion and creating the vectors [16].
    The MapReduce programming model consists of two                         There are three stages in a K-means job (figure 1):
user defined functions: map and reduce, specified on a job.                 - Initial stage: the segmentation of the dataset into
The job usually splits the input dataset into independent                         HDFS blocks; their replication and transfer to other
blocks which are processed by the map tasks in a parallel                         machines; according to the number of blocks and
way. Hadoop sorts the outputs of the maps, which are then                         cluster configuration it will be assigned and
                                                                                  distributed the necessary tasks.
1
    Centroid refers to the center of the cluster.
2                                                                       3
    http://hadoop.apache.org/                                               http://mahout.apache.org/



                                                                  515
INITIAL STAGE                                     MAP STAGE                                        REDUCE STAGE
         Dataset           Block
                                                                                                                         avg


                             (…)




                           Centroides                                                                                    avg
            Task
                             vector
         destribution



                                                Figure 1 – The three stages of a K-means job.

   -    Map stage: calculates the distances between samples
        and centroids; match samples with the nearest                                                  III.   EXPERIMENTS
        centroid and assigns them to that specific cluster.
        Each map task is processed with a data block.                             The dataset
    Reduce stage: recalculates the centroid point using the
average of the coordinates of all the points in that cluster.                   To evaluate the performance of multi node versus single
The associated points are averaged out to produce the new                   node we used a dataset from the 1999 kdd cup4. This dataset
location of the centroid. The controids configuration is                    was gathered by Lincoln Labs: nine weeks of raw TCP dump
feedback into the Mappers. The loop ends when the                           data were collected from a typical US air force LAN. During
centroids converge.                                                         the use of the LAN, several attacks were performed on it.
    After finishing the job Mahout provides the K-means                     The raw packet data were then aggregated into connection
time, the centroids and the clustered samples.                              data. Per record, extra features were derived, based on
                                                                            domain knowledge about network attacks. There are 38
   Quality of clustering                                                    different attack types, belonging to 4 main categories. It has
                                                                            4,940,000 records, with 41 attributes and 1 label (converted
    An important problem in clustering is how to decide what                to numerical. A 1.1 GB dataset was used. This file was
the best set of clusters is for a given data set, in terms of both          randomly segmented into smaller files. For the data
the number of clusters and the membership of those clusters                 preprocessing we developed a Java software.
[17].
    One simple inter-cluster quality measure was used in this                     The Machines
work: the sum of the distances squared between centroids
(Q1) [18].                                                                      We ran our tests using Amazon EC2 m1.large instances
                                                                            with the following characteristics: 7.5 GB memory; 4 EC2
                                                                            Compute Units (2 virtual cores with 2 EC2 Compute Units
                                                                            each); 850 GB of storage; 64-bit platform; high I/O
                                                 (eq. 1)                    Performance.
                                                                                All the instances were launched using the same Amazon
   where: ci is the centroid of a cluster ci;                               Machine Image (AMI). We adapted an existing Cloudera
   d is the distance metric;                                                AMI5 to our specific needs and stored on Amazon S3. Our
   and k is the number of clusters.                                         AMI uses Mahout 0.4 and Hadoop 0.20.2+237 and we added
                                                                            scripts for automatic cluster configuration. This way the user
   In our study, we do not aim to reach the optimal cluster                 just needs to launch EC2 instances with this AMI, run a
configuration. Instead, we want to check if the quality of the              script on the master node and the cluster became ready for
Mahout K-means is maintained between the tests. As so, we                   use. The HDFS configuration used was the default one,
chosen this measure since it is not computationally                         where the block size is 64 MB and the replication factor 3.
expensive and easy to implement.                                                Jobs
  Related work                                                                  To study how the performance changes with the
                                                                            scalability, K-means jobs were setup to guaranty that the
   At this date we could not find any intensive tests of                    algorithm will converge. K-means can produce different
Mahout in the literature. However there are two interesting                 results depending on the position of the initial centroids. To
works with scaling tools for data mining using MapReduce                    4
[19, 20].                                                                       http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html
                                                                            5
                                                                                cloudera-hadoop-ubuntu-20090623_x86_64


                                                                      516
avoid this issue, the same initial centroids were used for all
experiments. Each test was repeated 3 times to ensure the
results were not affected by Amazon fluctuations.
    The Mahout K-means algorithm was executed using the
following parameters: Euclidean distance as the similarity
measure on experiments A and B; Squared Euclidean on C.
All experiments were set with 150 maximum interactions,
and 11 clusters.
    To guarantee that the quality of the Mahout K-means was
maintained the equation 1 was applied.


              IV.     RESULTS AND INTERPRETATION

  A – How does Mahout scale with the file size?
                                                                                       Figure 2. Total convergence times for: 5 Multi node cluster and Single
   It can be observed on table 1 that a 5 multi node has                                                      node. File size: 1.1 GB.
worse performance than the single one for small datasets.
With a 66 MB file (6%) the distribution process is not                                 B – How does Mahout scale with the number of nodes?
compensated by the overhead of the initial step. Given that
the HDFS block size is 64 MB, only two tasks will be                                     Mahout decreases the clustering time as the number of
executed. Since each machine is a dual core there is no gain                         nodes increases. For 20 nodes it has reduced to less than one
expected for a file smaller than 128 MB. The gain is already                         fifth. In figure 3, it can be noticed that the cluster time
present for a file slightly bigger than 128 MB (12%                                  diminishes with the number of nodes in a non linear way.
scenario). The overhead is largely compensated as soon as
the dataset grows. The gain in performance reaches 351%
for a 1.1 GB file.

Data File kmeans             kmeans          Iterations      Gain
(%)         MN (sec)         SN (sec)        to converge (%)
          6           43.9            41.3                72           -6%
         12           45.3            52.8                78           16%
         25           46.4            88.5                72           91%
         50           48.6           149.1                71         207%
       100            70.3           316.7                56         351%
Table 1. Times per iteration and gain with the file size. MN - 5 Multi node.
                             SN - Single node.

   Figure 2 shows that when the data file is increased, the 5                         Figure 3. Runtimes per iteration with the number of nodes. File size: 1.1
                                                                                                                        GB.
node time enlarges very slowly. This seems to indicate that
the initial step has a considerable footprint. For this specific
dataset and set of fixed initial centroids, the K-means                                 One can verify on table 2 that the gain is almost 100%
needed fewer interactions to converge as the file became                             between 1 and 2 nodes. However, it increases only 444%
bigger. Since the k is the same this is an expected behavior.                        between 1 and 20 nodes. This happens due to the size of the
Given that there are more samples per cluster, the averages                          dataset, which seems to be too small for a 20 nodes. The file
of centroids are more stable and the algorithm converges                             is divided into 18 blocks. It was noticed 2 idle machines in
with less interactions. However, each iteration is more                              the 20 nodes cluster. For this dataset a possible improvement
computer demanding. On the other side, more iterations                               could be forcing a higher number of tasks.
means more communication between the mappers and the                                    It is interesting to notice that when the number of nodes
reducers. On a cluster this implies more network usage,                              increases, the CPU usage diminishes and the network usage
which can signify a bottleneck.                                                      enlarges. This can be clearly seen comparing figure 4, 5 and
                                                                                     6, 7. Those graphics were obtained from Amazon
                                                                                     Cloudwatch and correspond to the monitoring of a slave
                                                                                     node. They compare the situation between a 5 and a 10
                                                                                     nodes cluster executing K-means with the full dataset. The
                                                                                     CPU idle time on the 10 nodes scenario is noticeable. It
                                                                                     indicates that the instance was waiting for the results from
                                                                                     the other nodes. The network was continually used around


                                                                               517
84 Mbps in this scenario while it was interrupted on the 5
nodes. The existence of extremely high peaks was present
when the cluster was not executing a job. As so, they should
be related to Amazon and not to the experiment itself. Both




                                                                                         Bytes
graphs show how intensive was the network usage. They
also point towards a saturation of the network usage.

                 Nodes K-means          Gain
                           (min)        (%)
                        1           296
                        2           149          98%
                        5            80         271%                                             Figure 7. Network usage per node. 10 nodes.
                                                                                                              File size: 1.1 GB.
                       10            66         351%
                       20            54         444%
        Table 2. Converging times varying the number of nodes.                     C – How is Mahout influenced by the distance measure?
            File size: 1.1 GB. Gain comparing with 1 node.
                                                                                      The Euclidean Squared distance metric uses the same
                                                                                   equation as the Euclidean distance metric, but does not take
                                                                                   the square root. As a result, clustering with the Euclidean
                                                                                   Squared distance metric should be considerably faster than
          Utilization %




                                                                                   clustering with the regular Euclidean distance. However, the
                                                                                   experiment shows just a small improvement.

                                                                                           Distance          K-means    Iteractions
                                                                                           measure           (min)
                                                                                           Euclidean                 80            56
                                                                                           Squared Euclidean         79            56
Figure 4. CPU utilization per slave node. xx axis represents the time when            Table 3. Converging times using different measures. Cluster size 5.
           the experiment occured.5 nodes. File size: 1.1 GB.                                                File size: 1.1 GB.

                                                                                      As K-means in table 3 show the squared Euclidean
                                                                                   reduces only 1 minute to 80. The difference between the
                                                                                   distances would reduce each map and reducer times. This
                Utilization %




                                                                                   evidence together with others presented in this work
                                                                                   suggests that the first phase and the communication between
                                                                                   the machines can be relevant for the performance of
                                                                                   Mahout. It also implies that the reducer may have a
                                                                                   significant impact on each iteration time.

                  Figure 5. CPU utilization per slave node. 10 nodes.                                      V.     FUTURE WORK
                                  File size: 1.1 GB.

                                                                                     In the future the following influences on Mahout K-
                                                                                     means will be studied:
                                                                                         • bigger datasets;
                                                                                         • increase in the number of tasks;
                                                                                         • increase in the number of reducers;
                      Bytes




                                                                                         • one machine with n cores vs n machines with 1
                                                                                              core.


                                                                                                           VI.     CONCLUSIONS
                                Figure 6. Network usage per node. 5 nodes.
                                             File size: 1.1 GB.                       In this paper we tested how Mahout K-means scale. For
                                                                                   that purpose we have conducted several experiments




                                                                             518
varying the dataset size and the number of nodes. The                               [7]    S. B. Kotsiantis , P.E.P.: ‘Recent Advances in Clustering: A Brief
                                                                                           Survey’, WSEAS Transactions on Information Science and
influence of two different distance measures was also tested.                              Applications, 2004, 1, pp. 73 - 81
   We concluded that Mahout can be a promising tool for K-                          [8]    Abbas, O.A.: ‘Comparisons Between Data Clustering Algorithms’,
means clustering. It allows the clustering of large datasets, is                           The International Arab Journal of Information Technology, Vol 5.,
scalable and it produces significant gains in performance.                                 No. 3, July 2008, 2008, 5, (3), pp. 320-325
   Using Mahout with small files is not always the best                             [9]    Singla, B., Yadav, K., and Singh, J.: ‘Comparison and analysis of
option. There is an overhead due to replication and                                        clustering techniques’, in Editor (Ed.)^(Eds.): ‘Book Comparison and
                                                                                           analysis of clustering techniques’ (2008, edn.), pp. 1-3
distribution of the data blocks. The overhead is largely
                                                                                    [10]   YING ZHAO, G.K.: ‘Hierarchical Clustering Algorithms for
compensated as soon as the dataset grows.                                                  Document Datasets’, Data Mining and Knowledge Discovery, 2005,
   Increasing the number of nodes reduces the execution                                    10, pp. 141-168
times. However, for small files it can lead to an                                   [11]   Frigui, H., and Krishnapuram, R.: ‘Clustering by competitive
underutilization of each machine’s resources.                                              agglomeration’, Pattern Recognition, 1997, 30, (7), pp. 1109-1119
   The converging times do not depend largely on the                                [12]   Pang-Ning Tan, M.S.a.V.K.: ‘Introduction to Data Mining ’
distance measure.                                                                          (Addison-Wesley Longman Publishing Co., Inc., 2005, 1st edn. 2005)
   As expected the quality of Mahout K-means clustering                             [13]   C. IMMACULATE MARY, D.S.V.K.R.: ‘Refinement of clusters
                                                                                           from k-means with ant colony optimization’, Journal of Theoretical
was consistent between the several experiments.                                            and Applied Information Technology 2009, 9. No. 2
                                                                                    [14]   Dean, J., and Ghemawat, S.: ‘MapReduce: simplified data processing
                                                                                           on large clusters’, Commun. ACM, 2008, 51, (1), pp. 107-113
                             REFERENCES                                             [15]   White, T.: ‘Hadoop: The Definitive Guide’ (O'Reilly Media, 2009)
[1]   Fredrik Farnstrom, J.L.a.C.E.: ‘Scalability for Clustering Algorithms         [16]   Sean Owen, R.A., Ted Dunning and Ellen Friedman: ‘Mahout in
      Revisited’, SIGKDD Explorations, 2000, 2, pp. 51-57                                  Action’ (Manning Publications, 2010. 2010)
[2]   Jiawei Han, M.K., Jian Pei: ‘Data Mining: Concepts and Techniques’            [17]   Raskutti, B., and Leckie, C.: ‘An Evaluation of Criteria for Measuring
      (Elsevier, 2006, 2nd edn. 2006)                                                      the Quality of Clusters’. Proc. Proceedings of the Sixteenth
[3]   Osei-Bryson, K.-M.: ‘Assessing Cluster Quality Using Multiple                        International Joint Conference on Artificial Intelligence1999 pp.
      Measures - A Decision Tree Based Approach in The Next Wave in                        Pages
      Computing’, The Next Wave in Computing, Optimization, and                     [18]   Rayward-Smith, Q.H.N.a.V.J.: ‘Internal quality measures for
      Decision Technologies, 2005 29, (VI), pp. 371-384                                    clustering in metric space’, Int. J. Business Intelligence and Data
[4]   Kun Niu, C.H., Shubo Zhang, and Junliang Chen: ‘ODDC: Outlier                        Mining 2008, 3, (1)
      Detection Using Distance Distribution Clustering’. Proc. PAKDD                [19]   Shang, W., Adams, B., and Hassan, A.E.: ‘An experience report on
      2007 Workshops, 2007 2007 pp. Pages                                                  scaling tools for mining software repositories using MapReduce’.
[5]   Austin, V.J.H.a.J.: ‘A Survey of Outlier Detection Methodologies’,                   Proc. Proceedings of the IEEE/ACM international conference on
      Artificial Intelligence Review, 2004, 22, pp. 85-126                                 Automated software engineering, Antwerp, Belgium pp. Pages
[6]   A.K. JAIN, M.N.M.a.P.J.F.: ‘Data Clustering: A Review’, ACM                   [20]   Cheng T. Chu, S.K.K., Yi A. Lin, Yuanyuan Yu, Gary R. Bradski,
      Computing Surveys, 1999, 31, No. 3, September                                        Andrew Y. Ng, Kunle Olukotun: ‘Map-Reduce for Machine Learning
                                                                                           on Multicore’, NIPS, 2006, pp. 281-288




                                                                              519

More Related Content

What's hot

Ba2419551957
Ba2419551957Ba2419551957
Ba2419551957IJMER
 
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...IJERA Editor
 
GeoAI: A Model-Agnostic Meta-Ensemble Zero-Shot Learning Method for Hyperspec...
GeoAI: A Model-Agnostic Meta-Ensemble Zero-Shot Learning Method for Hyperspec...GeoAI: A Model-Agnostic Meta-Ensemble Zero-Shot Learning Method for Hyperspec...
GeoAI: A Model-Agnostic Meta-Ensemble Zero-Shot Learning Method for Hyperspec...Konstantinos Demertzis
 
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...IRJET Journal
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
X trepan an extended trepan for
X trepan an extended trepan forX trepan an extended trepan for
X trepan an extended trepan forijaia
 
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...ijaia
 
An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...Alexander Decker
 
SCALING THE HTM SPATIAL POOLER
SCALING THE HTM SPATIAL POOLERSCALING THE HTM SPATIAL POOLER
SCALING THE HTM SPATIAL POOLERijaia
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...ijcsity
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Editor IJARCET
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...Sunny Kr
 
Classification accuracy of sar images for various land
Classification accuracy of sar images for various landClassification accuracy of sar images for various land
Classification accuracy of sar images for various landeSAT Publishing House
 

What's hot (18)

Ba2419551957
Ba2419551957Ba2419551957
Ba2419551957
 
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
 
GeoAI: A Model-Agnostic Meta-Ensemble Zero-Shot Learning Method for Hyperspec...
GeoAI: A Model-Agnostic Meta-Ensemble Zero-Shot Learning Method for Hyperspec...GeoAI: A Model-Agnostic Meta-Ensemble Zero-Shot Learning Method for Hyperspec...
GeoAI: A Model-Agnostic Meta-Ensemble Zero-Shot Learning Method for Hyperspec...
 
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
X trepan an extended trepan for
X trepan an extended trepan forX trepan an extended trepan for
X trepan an extended trepan for
 
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
NETWORK LEARNING AND TRAINING OF A CASCADED LINK-BASED FEED FORWARD NEURAL NE...
 
An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...
 
SCALING THE HTM SPATIAL POOLER
SCALING THE HTM SPATIAL POOLERSCALING THE HTM SPATIAL POOLER
SCALING THE HTM SPATIAL POOLER
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
 
Cq4201618622
Cq4201618622Cq4201618622
Cq4201618622
 
Layering Based Network Intrusion Detection System to Enhance Network Attacks ...
Layering Based Network Intrusion Detection System to Enhance Network Attacks ...Layering Based Network Intrusion Detection System to Enhance Network Attacks ...
Layering Based Network Intrusion Detection System to Enhance Network Attacks ...
 
Classification accuracy of sar images for various land
Classification accuracy of sar images for various landClassification accuracy of sar images for various land
Classification accuracy of sar images for various land
 

Similar to K means clustering in the cloud - a mahout test

Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsIJERA Editor
 
Improved Performance of Unsupervised Method by Renovated K-Means
Improved Performance of Unsupervised Method by Renovated K-MeansImproved Performance of Unsupervised Method by Renovated K-Means
Improved Performance of Unsupervised Method by Renovated K-MeansIJASCSE
 
An Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using ClusteringAn Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using Clusteringidescitation
 
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningA h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clusteringbutest
 
MultiObjective(11) - Copy
MultiObjective(11) - CopyMultiObjective(11) - Copy
MultiObjective(11) - CopyAMIT KUMAR
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringIJERD Editor
 
A Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsA Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsIJMER
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXmlaij
 
Main single agent machine learning algorithms
Main single agent machine learning algorithmsMain single agent machine learning algorithms
Main single agent machine learning algorithmsbutest
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...IJCSIS Research Publications
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...IJCSIS Research Publications
 

Similar to K means clustering in the cloud - a mahout test (20)

Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data Streams
 
Improved Performance of Unsupervised Method by Renovated K-Means
Improved Performance of Unsupervised Method by Renovated K-MeansImproved Performance of Unsupervised Method by Renovated K-Means
Improved Performance of Unsupervised Method by Renovated K-Means
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
An Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using ClusteringAn Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using Clustering
 
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningA h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learning
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
 
MultiObjective(11) - Copy
MultiObjective(11) - CopyMultiObjective(11) - Copy
MultiObjective(11) - Copy
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
 
A Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsA Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text Documents
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
Main single agent machine learning algorithms
Main single agent machine learning algorithmsMain single agent machine learning algorithms
Main single agent machine learning algorithms
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Data clustring
Data clustring Data clustring
Data clustring
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
 

More from João Gabriel Lima

Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationJoão Gabriel Lima
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackJoão Gabriel Lima
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitJoão Gabriel Lima
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência ArtificialJoão Gabriel Lima
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão LinearJoão Gabriel Lima
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoJoão Gabriel Lima
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google HackingJoão Gabriel Lima
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisJoão Gabriel Lima
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...João Gabriel Lima
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoJoão Gabriel Lima
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaJoão Gabriel Lima
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideJoão Gabriel Lima
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?João Gabriel Lima
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...João Gabriel Lima
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosJoão Gabriel Lima
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.jsJoão Gabriel Lima
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptJoão Gabriel Lima
 

More from João Gabriel Lima (20)

Cooking with data
Cooking with dataCooking with data
Cooking with data
 
Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer Segmentation
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full Stack
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKit
 
JS - IA
JS - IAJS - IA
JS - IA
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência Artificial
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão Linear
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de caso
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google Hacking
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentais
 
Web Machine Learning
Web Machine LearningWeb Machine Learning
Web Machine Learning
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - Clusterização
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e Weka
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark side
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãos
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com Javascript
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

K means clustering in the cloud - a mahout test

  • 1. 2011 Workshops of International Conference on Advanced Information Networking and Applications K-means clustering in the cloud - a Mahout test Rui Máximo Esteves * Chunming Rong* rui.esteves@uis.no chunming.rong@uis.no Rui Pais * * Department of Electrical and Computer Engineering, rma.pais@stud.uis.no University of Stavanger, Norway Abstract—. The K-Means is a well known clustering algorithm that has been successfully applied to a wide variety of problems. II. THEORY However, its application has usually been restricted to small datasets. Mahout is a cloud computing approach to K-Means that Clustering runs on a Hadoop system. Both Mahout and Hadoop are free and open source. Due to their inexpensive and scalable Clustering refers to the process of grouping samples into characteristics, these platforms can be a promising technology to solve data intensive problems which were not trivial in the past. different classes based on their similarities. Samples within a In this work we studied the performance of Mahout using a large class have high similarity in comparison to one another but data set. The tests were running on Amazon EC2 instances and are very dissimilar to samples in other classes. These groups allowed to compare the gain in runtime when running on a multi or classes are called clusters. Clustering algorithms can node cluster. This paper presents some results of ongoing divide the samples into clusters automatically without any research. preconceptions about what kind of groupings should be found. Within the context of machine learning, clustering is considered to be a form of unsupervised learning since there is no target variable to guide the learning process. Unlike Keywords- K-means, cloud computing, mahout, map reduce classification, clustering and unsupervised learning do not rely on predefined classes and class-labeled training I. INTRODUCTION examples. For this reason, clustering is a form of learning by observation, rather than learning by examples [2]. The The K-means is one of the most frequently used clustering number of clusters, size, and shape are not in general known algorithms. It is simple and straightforward and has been in anticipation, and each of these parameters must be successfully applied during the last few years. Under the determined by either the user or the clustering algorithm [3]. assumption that datasets tend to be small, research on Clustering has been used in many areas, including data clustering algorithms has traditionally focused on improving mining, statistics, biology, and machine learning. Clustering the quality of clustering. However, many datasets now are can also be used for outlier detection, where outliers (values large and cannot fit into main memory [1]. that are “far away” from any cluster) may be more Mahout/Hadoop can be a promising and inexpensive interesting than common cases [2]. One useful application solution to solve problems with large data sets. However, example is fault detection [4, 5]. Other examples of there is a lack of studies regarding its performance. With this applications can be found in [2, 6]. The concept of clustering work we wanted to study the gain in runtime when using this is not particularly new, but it is still an important topic of solution and also to test if there was any loss in the clusters research. A useful survey about recent advances can be quality. To meet those aims we used Amazon EC2 instances found in [7]. and set up a small cluster. We chose to study K-means since it is one of the most popular and easy algorithms. Cluster algorithms types The rest of this paper is organized in the following way: section II presents some theoretical foundations. The The two major types of cluster algorithms are experiments are described in section III and section IV the hierarchical and partitional. results, interpretation and discussion are presented. Future The first type produces a hierarchical decomposition of work is in section V and the final conclusions are in section the data set into partitions. It merges or splits the partitions VI. into clusters using a similarity criterion. In the second type, an algorithm creates an initial division of the data set, according to a given number of partitions. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. The general criterion for partitioning is a combination of high similarity of the samples inside of clusters with high dissimilarity between distinct clusters. This means that samples in the same cluster are “close” or 978-0-7695-4338-3/11 $26.00 © 2011 IEEE 514 DOI 10.1109/WAINA.2011.136
  • 2. related to each other, whereas samples of different clusters the input to be processed by the reduce tasks [14]. Hadoop are “far apart” or very different. [2, 6] uses a distributed file system HDFS, which creates multiple Each clustering type is better suited than to a particular replicas of data blocks and distributes them on computer type of problem. Abbas [8] concluded that partitional nodes throughout a cluster to enable reliable, extremely rapid algorithms are suited for large data set while hierarchical are computations [15]. for small data sets. Singla, Yadav and Sing [9] presented Mahout3 is a machine learning library that runs over a some arguments in favor of partitional algorithm. According Hadoop system. It has a collection of algorithms to solve to them, there is no need to determine the full dendrogram clustering, classification and prediction problems. It uses with a partitional algorithm. They also found that the time MapReduce paradigm which in combination with Hadoop complexity of partitonal algorithm is little compared with can be used as an inexpensive solution to solve machine hierarchical. Zhao and Kaprypis [10] had similar results. On learning problems. the other end, hierarchical algorithms have the advantages of Mahout is a project still in development which has been an easily interpreted graphical representation and ignorance used mainly for recommendation engines, document of influence from initialization and local minima [11]. classifiers and to solve other web typical problems. At the moment, its usage for clustering is not sufficiently explored. K-means Hadoop and Mahout are free and open source projects. Due to their inexpensive and scalable characteristics, these K-means clustering is a widely used partition algorithm. platforms can be a promising technology to solve data It partitions the samples into clusters by minimizing a intensive problems which were not trivial in the past. measure between the samples and the centroid of the However, it is important to study the tradeoff between clusters. Euclidean distance is a similarity criterion the overhead of using Map/Reduce and the gain of extensively used for samples in Euclidean space. Manhattan performance by distributing the computation. It is also distance is a simpler measure also for Euclidean space. essential to check if the clusters maintain their quality. Cosine distance and Jaccard is often employed for Mahout contains various implementations of clustering, documents clustering. [12] like K-means, fuzzy K-means, meanshift and Dirichlet The algorithm for the standard K-means is given as among others. To input the data for Mahout clustering it is follows [13]: necessary to do some procedures first. If the data is not 1. Choose a number of clusters k numerical it has to be first preprocessed. It is required then to 2. Initialize centroid1 c1,… ck create vectors. If the data set is sparse it allows the user to 3. For each data point, compute the centroid it is closest create sparse vectors that are much more compact. The to (using some distance measure) and assign the data vectors are finally converted to a specific Hadoop file format point to this cluster. that is SequenceFile. 4. Re-compute centroids (mean of data points in The K-means clustering algorithm takes the following cluster) input parameters: 5. Stop when there are no new re-assignments. - A SequenceFile containing the input vectors. - A SequenceFile containing the initial cluster centers. If The K-means clustering is simple but it has high time not present it will attribute them randomly. complexity when the data sets are large. In these - A similarity measure to be used. circumstances the memory of a single machine can be a - The convergence Threshold that will be the stopping restriction. As a consequence it has not been used in the past condition of the K-means. If in a particular iteration, any with large data sets [13]. centers of the clusters do not change beyond that threshold, then no further iterations are done. Hadoop/Mahout solution - The maximum number of iterations to be processed if it does not converge first. Hadoop 2 is a software framework which allows the - The number of reducers to be used. This value effortless development of cloud computing systems. It determines the parallelism of the execution. supports data intensive distributed applications on large - The vector implementation used for the input files. clusters built of commodity hardware. This framework is As output the user gets the centroids coordinates and the designed to be scalable, which allows the user to add more samples attributed to each cluster. The output files are in nodes as the application requires. Hadoop uses a distributed SequenceFile format. Mahout provides the necessary tools computing paradigm named MapReduce. for file conversion and creating the vectors [16]. The MapReduce programming model consists of two There are three stages in a K-means job (figure 1): user defined functions: map and reduce, specified on a job. - Initial stage: the segmentation of the dataset into The job usually splits the input dataset into independent HDFS blocks; their replication and transfer to other blocks which are processed by the map tasks in a parallel machines; according to the number of blocks and way. Hadoop sorts the outputs of the maps, which are then cluster configuration it will be assigned and distributed the necessary tasks. 1 Centroid refers to the center of the cluster. 2 3 http://hadoop.apache.org/ http://mahout.apache.org/ 515
  • 3. INITIAL STAGE MAP STAGE REDUCE STAGE Dataset Block avg (…) Centroides avg Task vector destribution Figure 1 – The three stages of a K-means job. - Map stage: calculates the distances between samples and centroids; match samples with the nearest III. EXPERIMENTS centroid and assigns them to that specific cluster. Each map task is processed with a data block. The dataset Reduce stage: recalculates the centroid point using the average of the coordinates of all the points in that cluster. To evaluate the performance of multi node versus single The associated points are averaged out to produce the new node we used a dataset from the 1999 kdd cup4. This dataset location of the centroid. The controids configuration is was gathered by Lincoln Labs: nine weeks of raw TCP dump feedback into the Mappers. The loop ends when the data were collected from a typical US air force LAN. During centroids converge. the use of the LAN, several attacks were performed on it. After finishing the job Mahout provides the K-means The raw packet data were then aggregated into connection time, the centroids and the clustered samples. data. Per record, extra features were derived, based on domain knowledge about network attacks. There are 38 Quality of clustering different attack types, belonging to 4 main categories. It has 4,940,000 records, with 41 attributes and 1 label (converted An important problem in clustering is how to decide what to numerical. A 1.1 GB dataset was used. This file was the best set of clusters is for a given data set, in terms of both randomly segmented into smaller files. For the data the number of clusters and the membership of those clusters preprocessing we developed a Java software. [17]. One simple inter-cluster quality measure was used in this The Machines work: the sum of the distances squared between centroids (Q1) [18]. We ran our tests using Amazon EC2 m1.large instances with the following characteristics: 7.5 GB memory; 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each); 850 GB of storage; 64-bit platform; high I/O (eq. 1) Performance. All the instances were launched using the same Amazon where: ci is the centroid of a cluster ci; Machine Image (AMI). We adapted an existing Cloudera d is the distance metric; AMI5 to our specific needs and stored on Amazon S3. Our and k is the number of clusters. AMI uses Mahout 0.4 and Hadoop 0.20.2+237 and we added scripts for automatic cluster configuration. This way the user In our study, we do not aim to reach the optimal cluster just needs to launch EC2 instances with this AMI, run a configuration. Instead, we want to check if the quality of the script on the master node and the cluster became ready for Mahout K-means is maintained between the tests. As so, we use. The HDFS configuration used was the default one, chosen this measure since it is not computationally where the block size is 64 MB and the replication factor 3. expensive and easy to implement. Jobs Related work To study how the performance changes with the scalability, K-means jobs were setup to guaranty that the At this date we could not find any intensive tests of algorithm will converge. K-means can produce different Mahout in the literature. However there are two interesting results depending on the position of the initial centroids. To works with scaling tools for data mining using MapReduce 4 [19, 20]. http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html 5 cloudera-hadoop-ubuntu-20090623_x86_64 516
  • 4. avoid this issue, the same initial centroids were used for all experiments. Each test was repeated 3 times to ensure the results were not affected by Amazon fluctuations. The Mahout K-means algorithm was executed using the following parameters: Euclidean distance as the similarity measure on experiments A and B; Squared Euclidean on C. All experiments were set with 150 maximum interactions, and 11 clusters. To guarantee that the quality of the Mahout K-means was maintained the equation 1 was applied. IV. RESULTS AND INTERPRETATION A – How does Mahout scale with the file size? Figure 2. Total convergence times for: 5 Multi node cluster and Single It can be observed on table 1 that a 5 multi node has node. File size: 1.1 GB. worse performance than the single one for small datasets. With a 66 MB file (6%) the distribution process is not B – How does Mahout scale with the number of nodes? compensated by the overhead of the initial step. Given that the HDFS block size is 64 MB, only two tasks will be Mahout decreases the clustering time as the number of executed. Since each machine is a dual core there is no gain nodes increases. For 20 nodes it has reduced to less than one expected for a file smaller than 128 MB. The gain is already fifth. In figure 3, it can be noticed that the cluster time present for a file slightly bigger than 128 MB (12% diminishes with the number of nodes in a non linear way. scenario). The overhead is largely compensated as soon as the dataset grows. The gain in performance reaches 351% for a 1.1 GB file. Data File kmeans kmeans Iterations Gain (%) MN (sec) SN (sec) to converge (%) 6 43.9 41.3 72 -6% 12 45.3 52.8 78 16% 25 46.4 88.5 72 91% 50 48.6 149.1 71 207% 100 70.3 316.7 56 351% Table 1. Times per iteration and gain with the file size. MN - 5 Multi node. SN - Single node. Figure 2 shows that when the data file is increased, the 5 Figure 3. Runtimes per iteration with the number of nodes. File size: 1.1 GB. node time enlarges very slowly. This seems to indicate that the initial step has a considerable footprint. For this specific dataset and set of fixed initial centroids, the K-means One can verify on table 2 that the gain is almost 100% needed fewer interactions to converge as the file became between 1 and 2 nodes. However, it increases only 444% bigger. Since the k is the same this is an expected behavior. between 1 and 20 nodes. This happens due to the size of the Given that there are more samples per cluster, the averages dataset, which seems to be too small for a 20 nodes. The file of centroids are more stable and the algorithm converges is divided into 18 blocks. It was noticed 2 idle machines in with less interactions. However, each iteration is more the 20 nodes cluster. For this dataset a possible improvement computer demanding. On the other side, more iterations could be forcing a higher number of tasks. means more communication between the mappers and the It is interesting to notice that when the number of nodes reducers. On a cluster this implies more network usage, increases, the CPU usage diminishes and the network usage which can signify a bottleneck. enlarges. This can be clearly seen comparing figure 4, 5 and 6, 7. Those graphics were obtained from Amazon Cloudwatch and correspond to the monitoring of a slave node. They compare the situation between a 5 and a 10 nodes cluster executing K-means with the full dataset. The CPU idle time on the 10 nodes scenario is noticeable. It indicates that the instance was waiting for the results from the other nodes. The network was continually used around 517
  • 5. 84 Mbps in this scenario while it was interrupted on the 5 nodes. The existence of extremely high peaks was present when the cluster was not executing a job. As so, they should be related to Amazon and not to the experiment itself. Both Bytes graphs show how intensive was the network usage. They also point towards a saturation of the network usage. Nodes K-means Gain (min) (%) 1 296 2 149 98% 5 80 271% Figure 7. Network usage per node. 10 nodes. File size: 1.1 GB. 10 66 351% 20 54 444% Table 2. Converging times varying the number of nodes. C – How is Mahout influenced by the distance measure? File size: 1.1 GB. Gain comparing with 1 node. The Euclidean Squared distance metric uses the same equation as the Euclidean distance metric, but does not take the square root. As a result, clustering with the Euclidean Squared distance metric should be considerably faster than Utilization % clustering with the regular Euclidean distance. However, the experiment shows just a small improvement. Distance K-means Iteractions measure (min) Euclidean 80 56 Squared Euclidean 79 56 Figure 4. CPU utilization per slave node. xx axis represents the time when Table 3. Converging times using different measures. Cluster size 5. the experiment occured.5 nodes. File size: 1.1 GB. File size: 1.1 GB. As K-means in table 3 show the squared Euclidean reduces only 1 minute to 80. The difference between the distances would reduce each map and reducer times. This Utilization % evidence together with others presented in this work suggests that the first phase and the communication between the machines can be relevant for the performance of Mahout. It also implies that the reducer may have a significant impact on each iteration time. Figure 5. CPU utilization per slave node. 10 nodes. V. FUTURE WORK File size: 1.1 GB. In the future the following influences on Mahout K- means will be studied: • bigger datasets; • increase in the number of tasks; • increase in the number of reducers; Bytes • one machine with n cores vs n machines with 1 core. VI. CONCLUSIONS Figure 6. Network usage per node. 5 nodes. File size: 1.1 GB. In this paper we tested how Mahout K-means scale. For that purpose we have conducted several experiments 518
  • 6. varying the dataset size and the number of nodes. The [7] S. B. Kotsiantis , P.E.P.: ‘Recent Advances in Clustering: A Brief Survey’, WSEAS Transactions on Information Science and influence of two different distance measures was also tested. Applications, 2004, 1, pp. 73 - 81 We concluded that Mahout can be a promising tool for K- [8] Abbas, O.A.: ‘Comparisons Between Data Clustering Algorithms’, means clustering. It allows the clustering of large datasets, is The International Arab Journal of Information Technology, Vol 5., scalable and it produces significant gains in performance. No. 3, July 2008, 2008, 5, (3), pp. 320-325 Using Mahout with small files is not always the best [9] Singla, B., Yadav, K., and Singh, J.: ‘Comparison and analysis of option. There is an overhead due to replication and clustering techniques’, in Editor (Ed.)^(Eds.): ‘Book Comparison and analysis of clustering techniques’ (2008, edn.), pp. 1-3 distribution of the data blocks. The overhead is largely [10] YING ZHAO, G.K.: ‘Hierarchical Clustering Algorithms for compensated as soon as the dataset grows. Document Datasets’, Data Mining and Knowledge Discovery, 2005, Increasing the number of nodes reduces the execution 10, pp. 141-168 times. However, for small files it can lead to an [11] Frigui, H., and Krishnapuram, R.: ‘Clustering by competitive underutilization of each machine’s resources. agglomeration’, Pattern Recognition, 1997, 30, (7), pp. 1109-1119 The converging times do not depend largely on the [12] Pang-Ning Tan, M.S.a.V.K.: ‘Introduction to Data Mining ’ distance measure. (Addison-Wesley Longman Publishing Co., Inc., 2005, 1st edn. 2005) As expected the quality of Mahout K-means clustering [13] C. IMMACULATE MARY, D.S.V.K.R.: ‘Refinement of clusters from k-means with ant colony optimization’, Journal of Theoretical was consistent between the several experiments. and Applied Information Technology 2009, 9. No. 2 [14] Dean, J., and Ghemawat, S.: ‘MapReduce: simplified data processing on large clusters’, Commun. ACM, 2008, 51, (1), pp. 107-113 REFERENCES [15] White, T.: ‘Hadoop: The Definitive Guide’ (O'Reilly Media, 2009) [1] Fredrik Farnstrom, J.L.a.C.E.: ‘Scalability for Clustering Algorithms [16] Sean Owen, R.A., Ted Dunning and Ellen Friedman: ‘Mahout in Revisited’, SIGKDD Explorations, 2000, 2, pp. 51-57 Action’ (Manning Publications, 2010. 2010) [2] Jiawei Han, M.K., Jian Pei: ‘Data Mining: Concepts and Techniques’ [17] Raskutti, B., and Leckie, C.: ‘An Evaluation of Criteria for Measuring (Elsevier, 2006, 2nd edn. 2006) the Quality of Clusters’. Proc. Proceedings of the Sixteenth [3] Osei-Bryson, K.-M.: ‘Assessing Cluster Quality Using Multiple International Joint Conference on Artificial Intelligence1999 pp. Measures - A Decision Tree Based Approach in The Next Wave in Pages Computing’, The Next Wave in Computing, Optimization, and [18] Rayward-Smith, Q.H.N.a.V.J.: ‘Internal quality measures for Decision Technologies, 2005 29, (VI), pp. 371-384 clustering in metric space’, Int. J. Business Intelligence and Data [4] Kun Niu, C.H., Shubo Zhang, and Junliang Chen: ‘ODDC: Outlier Mining 2008, 3, (1) Detection Using Distance Distribution Clustering’. Proc. PAKDD [19] Shang, W., Adams, B., and Hassan, A.E.: ‘An experience report on 2007 Workshops, 2007 2007 pp. Pages scaling tools for mining software repositories using MapReduce’. [5] Austin, V.J.H.a.J.: ‘A Survey of Outlier Detection Methodologies’, Proc. Proceedings of the IEEE/ACM international conference on Artificial Intelligence Review, 2004, 22, pp. 85-126 Automated software engineering, Antwerp, Belgium pp. Pages [6] A.K. JAIN, M.N.M.a.P.J.F.: ‘Data Clustering: A Review’, ACM [20] Cheng T. Chu, S.K.K., Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Computing Surveys, 1999, 31, No. 3, September Andrew Y. Ng, Kunle Olukotun: ‘Map-Reduce for Machine Learning on Multicore’, NIPS, 2006, pp. 281-288 519