SlideShare a Scribd company logo
1 of 23
A new clustering tool
          of Data Mining
     RAPID MINER
Introduction To Clustering
 Unsupervised learning when old data with class
  labels not available e.g. when introducing a new
  product.
 Group/cluster existing customers based on time
  series of payment history such that similar
  customers in same cluster.
 Key requirement: Need a good measure of similarity
  between instances.
 Identify micro-markets and develop policies for
  each
About The Project
 Aim of this project is to devise a new algorithm of
  clustering for Data Mining
 The main functionalities which would be implemented in
  the system would be preprocessing and clustering.
 In the preprocessing of the data, input file, .xls file can be
  chosen. The null values, if any, present in the input file
  would be removed in order to avoid the occurrence of
  faulty results in the output data sets. The redundancy or
  duplicity in the data sets of the attributes is removed.
 In the clustering, the data is distributed into groups, so that
  the degree of association to be strong between members of
  the same cluster and weak between members of different
  clusters.
Present Tool: Weka
 Weka (Waikato Environment for Knowledge Analysis) is a popular
  suite of machine learning software written in Java, developed at
  the University of Waikato, New Zealand.
 The Explorer interface features several panels providing access to the
  main components of the workbench:
 The Preprocess panel has facilities for importing data from a database,
  a CSV file, etc., and for preprocessing this data using a so-called
  filtering algorithm. These filters can be used to transform the data (e.g.,
  turning numeric attributes into discrete ones) and make it possible to
  delete instances and attributes according to specific criteria.
 The Cluster panel gives access to the clustering techniques in Weka,
  e.g., the simple k-means algorithm. There is also an implementation of
  the expectation maximization algorithm for learning a mixture
  of normal distributions.
Our tool:
 Initially in the data preprocessing phase, the MS-Excel File is taken as
  input. There is no question of CSV of ARFF File(s). This is done since
  Excel file(s) are well known and comfortably handled by non-technical
  people as well. But, CSV and ARFF file(s) are needed to be well versed
  with also. This was done by importing a new library, the ‘jxl.jar’ library
  into the project.
 File(s) for data mining is firstly cleaned, by removing the null data sets
  from the input file(s). Null data sets are the data sets that contained no
  information or some information less than a threshold (minimum
  number of values of required attributes) value. The number of null
  data sets is reported to the user of the system as well. The second thing
  that was done was to remove redundancy/ duplicity of data sets from
  the file(s). Redundant/ Duplicate data sets are the data sets which have
  all the attribute values same in value with some other data set. These
  data sets are eliminated for the further process of data mining. The
  number of these redundant/ duplicate data sets is also reported to the
  user.
KD Trees
 K Dimensional Trees
 Space Partitioning Data Structure
 Splitting planes perpendicular to
  Coordinate Axes
 Reduces the Overall Time Complexity to
  O(log n)
Clustering
 Our Clustering Algorithm uses KD Tree extensively for
  improving its Time Complexity Requirements.
 Our algorithm differs from existing approach in how
  nearest centers are computed.
 Efficiency is achieved because the data points do not
  vary throughout the computation and, hence, this data
  structure does not need to be recomputed at each
  stage.
K-means Clustering
 Complexity is O( n * K * I * d )
 – n = number of points, K = number of clusters,
 I = number of iterations, d = number of attributes
K means
 K-Means methodology is a commonly used clustering technique. In
  this analysis the user starts with a collection of samples and attempts to
  group them into ‘k’ Number of Clusters based on certain specific
  distance measurements. The prominent steps involved in the K-Means
  clustering algorithm are given below.
 1. This algorithm is initiated by creating ‘k’ different clusters. The given
  sample set is first randomly distributed between these ‘k’ different
  clusters.
 2. As a next step, the distance measurement between each of the
  sample, within a given cluster, to their respective cluster centroid is
  calculated.
 3. Samples are then moved to a cluster (k ¢ ) that records the shortest
  distance from a sample to the cluster (k ¢ ) centroid.
 As a first step to the cluster analysis, the user decides
  on the Number of Clusters‘k’. This parameter could
  take definite integer values with the lower bound of 1
  (in practice, 2 is the smallest relevant number of
  clusters) and an upper bound that equals the total
  number of samples.

 The K-Means algorithm is repeated a number of times
  to obtain an optimal clustering solution, every time
  starting with a random set of initial clusters.
COMPARISON OF OUR TOOL WITH WEKA

  A set of data with the following statistics was run on
  WEKA and our tool both :

 Relation = weather
 No. of attributes = 3
 No. of Instances ( including redundant/ duplicate and
  null instances) = 17
Limitations :-
This tool does not provide protection from:
 Shared storage failures.

 Network service failures.

 Operational errors.

 Site disasters (unless a geographically dispersed
 clustering solution has been implemented).
In the near future…
 Market analysis
   Marketing strategies
   Advertisement
 Risk analysis and management
   Finance and finance investments
   Manufacturing and production
 Fraud detection and detection of unusual patterns
 (outliers)
   Telecommunication
   Finanancial transactions
   Anti-terrorism (!!!)
CONCLUSION
   We device a new algorithm for clustering by considering the following variations:-

 MS-Excel File(s) is successfully read, handled and processed by the system with the help
  of ‘jxl.jar’ library. By using this library, new features and functionalities of using Excel
  document were known.

 Null data sets were removed comfortably. Along with this, redundant and duplicate data
  sets were also removed.

 This algorithm choose better starting clusters i.e. choosing the initial values (or “seeds”)
  for the clustering algorithm.

 A filtering algorithm is included in this which uses KD-TREES to speed up each k-mean
  step.

 The initial centers are chosen in this algorithm. K-MEANS does not specify how they are
  to be selected.

 An inappropriate choice of number of clusters can yield poor results. That is why,
  number of clusters are determined properly in the data set.
References
 An Efficient k-Means Clustering Algorithm: Analysis and
  Implementation - Tapas Kanungo, Nathan
S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu.
 Introduction to Clustering Techniques – by Leo Wanner
 A comprehensive overview of Basic Clustering Algorithms –
Glenn Fung
 Introduction to Data Mining –
Tan/Steinbach/Kumar
Questions/comments…?

More Related Content

What's hot

Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysisguest0edcaf
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysisAnimesh Kumar
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methodsKrish_ver2
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringDr Nisha Arora
 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clusteringMegha Sharma
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Mustafa Sherazi
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesFarzad Nozarian
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 

What's hot (20)

Dataa miining
Dataa miiningDataa miining
Dataa miining
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Clustering
ClusteringClustering
Clustering
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Data clustering
Data clustering Data clustering
Data clustering
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Lect4
Lect4Lect4
Lect4
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clustering
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
Clustering
ClusteringClustering
Clustering
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 

Similar to Clustering

IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELJenny Liu
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET Journal
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET Journal
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...Madan Golla
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016ijcsbi
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel basedIJITCA Journal
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2IAEME Publication
 
Observations
ObservationsObservations
Observationsbutest
 

Similar to Clustering (20)

IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel based
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
 
Observations
ObservationsObservations
Observations
 
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
 

Recently uploaded

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

Clustering

  • 1. A new clustering tool of Data Mining RAPID MINER
  • 2. Introduction To Clustering  Unsupervised learning when old data with class labels not available e.g. when introducing a new product.  Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.  Key requirement: Need a good measure of similarity between instances.  Identify micro-markets and develop policies for each
  • 3. About The Project  Aim of this project is to devise a new algorithm of clustering for Data Mining  The main functionalities which would be implemented in the system would be preprocessing and clustering.  In the preprocessing of the data, input file, .xls file can be chosen. The null values, if any, present in the input file would be removed in order to avoid the occurrence of faulty results in the output data sets. The redundancy or duplicity in the data sets of the attributes is removed.  In the clustering, the data is distributed into groups, so that the degree of association to be strong between members of the same cluster and weak between members of different clusters.
  • 4. Present Tool: Weka  Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand.  The Explorer interface features several panels providing access to the main components of the workbench:  The Preprocess panel has facilities for importing data from a database, a CSV file, etc., and for preprocessing this data using a so-called filtering algorithm. These filters can be used to transform the data (e.g., turning numeric attributes into discrete ones) and make it possible to delete instances and attributes according to specific criteria.  The Cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-means algorithm. There is also an implementation of the expectation maximization algorithm for learning a mixture of normal distributions.
  • 5. Our tool:  Initially in the data preprocessing phase, the MS-Excel File is taken as input. There is no question of CSV of ARFF File(s). This is done since Excel file(s) are well known and comfortably handled by non-technical people as well. But, CSV and ARFF file(s) are needed to be well versed with also. This was done by importing a new library, the ‘jxl.jar’ library into the project.  File(s) for data mining is firstly cleaned, by removing the null data sets from the input file(s). Null data sets are the data sets that contained no information or some information less than a threshold (minimum number of values of required attributes) value. The number of null data sets is reported to the user of the system as well. The second thing that was done was to remove redundancy/ duplicity of data sets from the file(s). Redundant/ Duplicate data sets are the data sets which have all the attribute values same in value with some other data set. These data sets are eliminated for the further process of data mining. The number of these redundant/ duplicate data sets is also reported to the user.
  • 6. KD Trees  K Dimensional Trees  Space Partitioning Data Structure  Splitting planes perpendicular to Coordinate Axes  Reduces the Overall Time Complexity to O(log n)
  • 7. Clustering  Our Clustering Algorithm uses KD Tree extensively for improving its Time Complexity Requirements.  Our algorithm differs from existing approach in how nearest centers are computed.  Efficiency is achieved because the data points do not vary throughout the computation and, hence, this data structure does not need to be recomputed at each stage.
  • 8. K-means Clustering  Complexity is O( n * K * I * d )  – n = number of points, K = number of clusters,  I = number of iterations, d = number of attributes
  • 9. K means  K-Means methodology is a commonly used clustering technique. In this analysis the user starts with a collection of samples and attempts to group them into ‘k’ Number of Clusters based on certain specific distance measurements. The prominent steps involved in the K-Means clustering algorithm are given below.  1. This algorithm is initiated by creating ‘k’ different clusters. The given sample set is first randomly distributed between these ‘k’ different clusters.  2. As a next step, the distance measurement between each of the sample, within a given cluster, to their respective cluster centroid is calculated.  3. Samples are then moved to a cluster (k ¢ ) that records the shortest distance from a sample to the cluster (k ¢ ) centroid.
  • 10.  As a first step to the cluster analysis, the user decides on the Number of Clusters‘k’. This parameter could take definite integer values with the lower bound of 1 (in practice, 2 is the smallest relevant number of clusters) and an upper bound that equals the total number of samples.  The K-Means algorithm is repeated a number of times to obtain an optimal clustering solution, every time starting with a random set of initial clusters.
  • 11. COMPARISON OF OUR TOOL WITH WEKA A set of data with the following statistics was run on WEKA and our tool both :  Relation = weather  No. of attributes = 3  No. of Instances ( including redundant/ duplicate and null instances) = 17
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. Limitations :- This tool does not provide protection from:  Shared storage failures.  Network service failures.  Operational errors.  Site disasters (unless a geographically dispersed clustering solution has been implemented).
  • 19. In the near future…  Market analysis  Marketing strategies  Advertisement  Risk analysis and management  Finance and finance investments  Manufacturing and production  Fraud detection and detection of unusual patterns (outliers)  Telecommunication  Finanancial transactions  Anti-terrorism (!!!)
  • 20. CONCLUSION We device a new algorithm for clustering by considering the following variations:-  MS-Excel File(s) is successfully read, handled and processed by the system with the help of ‘jxl.jar’ library. By using this library, new features and functionalities of using Excel document were known.  Null data sets were removed comfortably. Along with this, redundant and duplicate data sets were also removed.  This algorithm choose better starting clusters i.e. choosing the initial values (or “seeds”) for the clustering algorithm.  A filtering algorithm is included in this which uses KD-TREES to speed up each k-mean step.  The initial centers are chosen in this algorithm. K-MEANS does not specify how they are to be selected.  An inappropriate choice of number of clusters can yield poor results. That is why, number of clusters are determined properly in the data set.
  • 21. References  An Efficient k-Means Clustering Algorithm: Analysis and Implementation - Tapas Kanungo, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu.  Introduction to Clustering Techniques – by Leo Wanner  A comprehensive overview of Basic Clustering Algorithms – Glenn Fung  Introduction to Data Mining – Tan/Steinbach/Kumar
  • 22.