Project 0th Review


Published on

0th R

Published in: Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Project 0th Review

  1. 1. A Combined Approach for Clustering based on the GSA-KM and Genetic AlgorithmsDivakar Raj.M (0901016) Under the guidance ofDilip.M (0901015) Mr.P.PerumalKishore Kumar.C (0901036) Associate ProfessorIV CSE - A Department of Computer Science and Engineering (UG) Data Mining / Clustering 1/33
  2. 2. Introduction about Data Mining• Data mining (knowledge discovery in databases): – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases• Potential Applications – Market analysis and management – Risk analysis and management – Fraud detection and management – Text mining (news group, email, documents) and Web analysis – Intelligent query answering Data Mining / Clustering 2/33
  3. 3. Data Mining: A KDD Process – Data mining: the core of Pattern Evaluation knowledge discovery process. Data Mining Task-relevant Data Data Warehouse SelectionData Cleaning Data Integration Databases Data Mining / Clustering 3/33
  4. 4. Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server FilteringData cleaning & data integration Data Databases Warehouse 4/33 Data Mining / Clustering
  5. 5. Data Mining Functionalities• Concept description: Characterization and discrimination – Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions• Association (correlation and causality) – Multi-dimensional vs. single-dimensional association – age(X, ―20..29‖) ^ income(X, ―20..29K‖) buys(X, ―PC‖) – contains(T, ―computer‖) contains(x, ―software‖) Data Mining / Clustering 5/33
  6. 6. Data Mining Functionalities• Classification and Prediction – Finding models (functions) that describe and distinguish classes or concepts for future prediction – E.g., classify countries based on climate, or classify cars based on gas mileage – Presentation: decision-tree, classification rule, neural network – Prediction: Predict some unknown or missing numerical values• Cluster analysis – Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns – Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity Data Mining / Clustering 6/33
  7. 7. Data Mining Functionalities• Outlier analysis – Outlier: a data object that does not comply with the general behavior of the data – It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis• Trend and evolution analysis – Trend and deviation: regression analysis – Sequential pattern mining, periodicity analysis – Similarity-based analysis Data Mining / Clustering 7/33
  8. 8. Issues in Data mining• Individual Privacy• Data Integrity• Relational Database Structure (vs) Multidimensional One• Issue of Cost• Mining methodology and user interaction issues• Performance issues• Issues relating to the diversity of database types Data Mining / Clustering 8/33
  9. 9. Applications• Database analysis and decision support – Market analysis and management • Target Marketing, Customer Relation Management, Market Basket Analysis, Cross Selling, Market Segmentation – Risk analysis and management • Forecasting, Customer Retention, Improved Underwriting, Quality Control, Competitive Analysis Data Mining / Clustering 9/33
  10. 10. Applications• Text mining (news group, email, documents) and Web analysis• Intelligent query answering• Sports• Astronomy• Internet Web Surf-Aid Data Mining / Clustering 10/33
  11. 11. Clustering• Clustering is a data mining (machine learning) technique used to place data elements into related groups without advance knowledge of the group definitions• Set of meaningful sub classes called clusters Data Mining / Clustering 11/33
  12. 12. Cluster Analysis• Cluster: a collection of data objects – Similar to one another within the same cluster – Dissimilar to the objects in other clusters• Cluster analysis – Grouping a set of data objects into clusters• Clustering is unsupervised classification: no predefined classes• Typical applications – As a stand-alone tool to get insight into data distribution – As a preprocessing step for other algorithms Data Mining / Clustering 12/33
  13. 13. What Is Good Clustering?• A good clustering method will produce high quality clusters with – high intra-class similarity – low inter-class similarity• The quality of a clustering result depends on both the similarity measure used by the method and its implementation.• The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Data Mining / Clustering 13/33
  14. 14. Requirements of Clustering in Data Mining • Scalability • Ability to deal with different types of attributes • Discovery of clusters with arbitrary shape • Minimal requirements for domain knowledge to determine input parameters • Able to deal with noise and outliers • Insensitive to order of input records • High dimensionality • Incorporation of user-specified constraints • Interpretability and Usability Data Mining / Clustering 14/33
  15. 15. Major Clustering Approaches• Partitioning algorithms: Construct various partitions and then evaluate them by some criterion• Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion• Density-based: based on connectivity and density functions• Grid-based: based on a multiple-level granularity structure• Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other Data Mining / Clustering 15/33
  16. 16. Issues of Clustering• Assessment of results• Choice of appropriate number of clusters• Data preparation• Proximity measures• Handling outliers Data Mining / Clustering 16/33
  17. 17. General Applications of Clustering• Pattern Recognition• Image Processing• Economic Science (especially market research)• WWW – Document classification – Cluster Weblog data to discover groups of similar access patterns Data Mining / Clustering 17/33
  18. 18. Examples of Clustering Applications• Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs• Land use: Identification of areas of similar land use in an earth observation database• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost• City-planning: Identifying groups of houses according to their house type, value, and geographical location• Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Data Mining / Clustering 18/33
  19. 19. Literature Survey[1] An Architecture for Component-Based Design of Representative- Based Clustering Algorithms Boris Delibas, Milan Vuki, Milos Jovanovi, Kathrin Kirchner, Johannes Ruhland, Milija Suknovic (2012)[2] The Research of Imbalanced Data Set of Sample Sampling Method Based on K-Means Cluster and Genetic Algorithm Yang Yong, (2012)[3] A Combined Approach for Clustering based on K-means and Gravitational Search Algorithms Abdolreza Hatamlou, Salwani Abdullah, Hossein Nezamabadi- pour, (2012) Data Mining / Clustering 19/33
  20. 20. An Architecture for Component-Based Design of Representative-Based Clustering Algorithms• Based on reusable components• Components derived from K-Means like algorithms and their extensions• The new algorithm is built by exchanging components from the original algorithm and their improvements• The Comparison & Evaluation are possible by using Representative Based Clustering Algorithm Data Mining / Clustering 20/33
  21. 21. The Research of Imbalanced Data Set of Sample Sampling Method Based on K-Means Cluster and Genetic Algorithm• We use K-Means to cluster & In each cluster, we use GA to carry on the valid confirmation and to gain a new sample• Enhances the classified performance of imbalanced datasets• Generates unbalanced data set’s minority class• Attention to Classification’s accuracy of Minority Classes Data Mining / Clustering 21/33
  22. 22. A Combined Approach for Clustering based on K- means and Gravitational Search Algorithms• A hybrid data clustering algorithm based on GSA and k-means (GSA-KM) is presented• It uses the advantages of both algorithms• Comparison of the performance of GSA-KM with other well-known algorithms – K-means – Genetic Algorithm(GA) – Simulated Annealing(SA) – Ant Colony Optimization(ACO) – Honey Bee Mating Optimization(HBMO) – Particle Swarm Optimization(PSO) – Gravitational Search Algorithm(GSA)• Comparison based on real and standard datasets from the UCI repository Data Mining / Clustering 22/33
  23. 23. Existing SystemK-Means• One of the most efficient and famous clustering algorithms• Starts with some random or heuristic-based centroids for the desired clusters• Assigns every data object to the closest centroid• Iteratively refines the current centroids to reach the (near) optimal ones by calculating the mean value of data objects within their respective clusters• The algorithm will terminate when any one of the specified termination criteria is met (i.e., a predetermined maximum number of iterations is reached, a (near) optimal solution is found or the maximum search time is reached) Data Mining / Clustering 23/33
  24. 24. Existing SystemGravitational Search Algorithm• Inspired by the physical phenomenon of Gravity• Based on the interaction of masses in the universe via Newtonian gravity law• Attraction depends on the amount of masses and the distance between them 2• F = G (M1*M2) / R Data Mining / Clustering 24/33
  25. 25. Drawbacks of Existing SystemK – Means• Performance is highly dependent on the initial state of centroids• May converge to the local optima rather than global optima• The number of clusters is needed as input to the algorithm, i.e. the number of clusters is assumed known Data Mining / Clustering 25/33
  26. 26. GSA-KM• Built on three main steps 1. GSA-KM applies k-means algorithm on selected dataset and tries to produce near optimal centroids for desired clusters 2. The proposed approach will produce an initial population of solutions 3. Application of the GSA Algorithm Data Mining / Clustering 26/33
  27. 27. GSA - KMWays for production of an initial population• One of the candidate solutions will be produced by the output of the k-means algorithm, which has been achieved in the previous step• Three of them will be created based on the dataset itself and other solutions will be produced randomly• GSA will be employed for determining an optimal solution for the clustering problem Data Mining / Clustering 27/33
  28. 28. Reasons for Efficiency• Decreases the number of iterations and function evaluations to find a near global optimum compared to the original GSA alone• With the advent of a good candidate solution in the initial population, GSA can search for near global optima in a promising search space and, therefore, find a high quality solution in comparison with the original GSA alone Data Mining / Clustering 28/33
  29. 29. Proposed System• Along with the given GSA-KM, we intend to implement Genetic Algorithm to further increase the efficiency and speed of the clustering• The proposed system will have combined advantages and will be faster and efficient than the traditional clustering algorithms and also GSA-KM Data Mining / Clustering 29/33
  30. 30. Implementation Details• Programming language : C#• Database : MS- Access• The given repository is clustered using K-Means and GSA, combinedly called GSA-KM and Genetic Algorithm is used to enhance the performance• The performance is calculated and compared with other clustering algorithms Data Mining / Clustering 30/33
  31. 31. References[1] C.L. Blake, C.J. Merz UCI repository of machine learning databases[2] S. Das, A. Abraham, A. Konar Meta heuristic pattern clustering —an overview Studies in Computational Intelligence (2009)[3] L. Kaufman, P.J. Rousseeuw Finding Groups in Data: An Introduction to Cluster Analysis John Wiley & Sons, New York, (1990)[4] M.B. Adil Modified global-means algorithm for minimum sum-of- squares clustering problems Pattern Recognition 41 (10) (2008)[5] E. Rashedi, H. Nezamabadi-pour, S. Saryazdi GSA: a gravitational search algorithm Information Sciences 179 (13) (2009) Data Mining / Clustering 31/33
  32. 32. References[6] A. Likas, N. Vlassis, J.J. Verbeek The global k -means clustering algorithm Pattern Recognition 36 (2) (2003)[7] M. Mahdavi Novel meta-heuristic algorithms for clustering web documents Applied Mathematics and Computation (2008)[8] M. Moshtaghi Clustering ellipses for anomaly detection Pattern Recognition 44 (2008)[9] B. Saglam, et al., A mixed-integer programming approach to the clustering problem with an application in customer segmentation European Journal of Operational Research 173 (3) (2006)[10] A.K. Jain Data clustering: 50 years beyond K –means Pattern Recognition Letters 31 (8) (2010) Data Mining / Clustering 32/33
  33. 33. Thank You !!! Data Mining / Clustering 33/33