SlideShare a Scribd company logo
1 of 47
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Introduction   Definitions  Clustering  Discussion Open Problems CONTENTS 
Prof. Dr. Benno Stein  Dr. Sven Meyer zu Eissen  Weimar University,  Germany Ideas,  Materials,  Collaboration - Structuring and Indexing - AI search  in IR  - Semi-Supervised Learning Dr. Xiaojin Zhu University of Wisconsin, Madison ,  USA
TEXTUAL DATA Subject of Grouping NON TEXTUAL DATA Local Terminology It is not important what is the source of data: textual or non textual. Data :  work in the space of numerical   parameters   Texts :  work in the space of  words   Example: typical dialog between  passenger (  US   ) and railway  directory inquires (  DI  )
TEXTS  (‘indexed’) Presentation of Textual Data  Vector model (‘parameterized’) Local Terminology Indexed   texts  are only parameterized texts in the  space of words
TEXTS  (‘indexed’) Presentation of Textual Data  TEXTS (‘parameterized’)  Local Terminology  Indexed   texts  are only parameterized texts in the  space of themes   = >  Category /Context  Vector Models... Example :  manually parameterized dialogs in the  space of parameters  (transport service and passenger needs)
Introduction   Definitions   Clustering  Discussion Open Problems CONTENTS 
Unsupervised Learning Types  of  Grouping Supervised Learning Semi-Supervised Learning We know nothing about data sructure We know well  data sructure We know something about data sructure
Clustering  Classification Characteristics:  Characteristics : Absence of  patterns  or  descriptions   Presence of  patterns  or  descriptiones of classes, so the results  are  of classes, so the results are  defined by the  nature   of  defined by the  user   (  N>=1  ) the data themselves  (  N >1  ) Synonyms:  Synonyms : Classification   without teacher   Classificatio n   with teacher Unsupervised   learning   Supervised   learning Number of clusters   Specials terms :   [  ] is known   exactly   Categorization   (of documents) [x] is known  approximately   Diagnostics   (technics, medicine) [  ]  is  not   known  =>  searching  Recognition  (technics, science) Types  of  Grouping
“ Semi Clustering/Classification”  Classification Characteristics:  Characteristics : Presence of  limited number   Presence of  patterns  or of patterns, so the results  are  descriptiones   of classes, so the results defined both by the  user  or  are defined by the  user   (  N>=1  ) by  the  data  themselves  (  N >1  ) Synonyms:  Synonyms : Semi-Classification   Classificatio n   with teacher Semi Supervised   learning   Supervised   learning Number of clusters/categories  Specials terms :   [  ] is known   exactly   Categorization   (of documents) [x] is known  approximately  Diagnostics   (technics, medicine) [  ]  is  not   known  =>  searching  Recognition  (technics, science) Types  of  Grouping
Objectives of Grouping 1. Organization  (structuring) of an object set  Process is named   data structuring   2. Searching interesting patterns  Process is named   navigation 3. Grouping for other applications: -  Knowledge   discovery (clustering) -  Summarization   of documents   Note:  Do not mix  the   type   of grouping  and its  objective
Classification of methods ,[object Object],[object Object],[object Object],[object Object],[object Object],Based on data presentation Methods oriented on  free metric space Every object is presented as a point in a free space Methods oriented on graphs Every object is presented as an element on graph
Hard grouping   Hard   clustering Hard   categorization Soft grouping Soft   clustering Soft   categorization Example The distribution of letters of Moscovites to the Government is  soft categorization   (numbers in the table reflect the relative weight of each theme)  Fuzzy Grouping
Preprocessing  <= Processing General Scheme of Clustering Process - I Principal idea : To transform texts to  num erical  form in order to use  matematical   tools   Remember :   our  problem is grouping textual  documents but  not  undestanding Here: Both rude and good matrixes are matrix  Object/Attributes
General Scheme of Clustering Process - II Preprocessing  Processing  <= Here: matrix   Attribute/Attribute  can be used instead of matrix  Object/Object
Matrixes to be Considered
Clustering for Categorization ,[object Object],[object Object],Matrix contains the value of word co-occurrences in texts.  Red :  if value more than some threshold. White :  if less.
Clustering for Categorizatión ,[object Object],[object Object],Words are groupped. Cluster = >   Subdictionary Absence of blocks means absence  of  Subthemes
Importance of Preprocessing  ( it takes   60%-90% of efforts)
Introduction   Definitions  Clustering   Discussion Open Problems CONTENTS 
Definitions Def. 1  “Let us V be the set of objects. Clustering  C  = { Ci   |  Ci  є  V   }   of V  is  division  of V  on subsets,  for which we have :  U i Ci = V  and  Ci  ∩ Cj = 0  i  ≠j“ Def. 2  “Let us V be the set of nodes, E be arcs,  φ   is weight function that reflects the distance between objects, so we have a weighted graph  G  = { V,E,  φ  }.   In this case  C   is named as clustering of  G .” In the framework of the second definition every Ci  produced subgraph G(Ci).  Both subsets  Ci  and subgraphs  G(Ci)   are  named  clusters . Graph Set Clique
Definitions Principal note Both  definitions  SAYS NOTHING :  -  about quality of clusters  -  about  numbers of  clusters   Reason of difficulties Nowadays  there is no any general agreement   about any universal defintion of the term  ‘ cluster ’ What means that clustering  is good ? 1. Closeness between  objets  inside clusters  is  essentially more  than  the  closeness  between  clusters   themselves 2. Constructed  clusters correspond  to  intuitive presentations  of users  ( they are  natural   clusters)
Classification  of  methods ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Based on the way of grouping
Hierarchy based methods Neighbors. Every object is cluster ,[object Object],[object Object],[object Object],[object Object],[object Object]
Hierarchy  based  methods Nearest neighbor  method   (NN)
Exemplar based methods K - means,  centroid
Method K-means ,[object Object],[object Object],[object Object],[object Object],Exemplar based methods
Method X-means (Dan Pelleg, Andrew Moor) Approach Using evaluation of object distribution  Selection of the most  likely points Advantage - More rapid - Number of cluster  is not fixed (in all cases it tends to be less)   Exemplar based methods
Density based methods ,[object Object],[object Object],[object Object],[object Object],[object Object],Suboptimal solution Only part  of neighbors are considered on every step  (to save time, to avoid mergence) .
Density based methods ,[object Object],[object Object],[object Object],[object Object],[object Object],Preprocessing for MajorClust Many weak links  can be stronger than the several  strongest ones that disfigures  results.  So: weak links should be  eliminated before clustering
Cluster Validity ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Dunn   index (to be  max )
Cluster Validity ,[object Object],[object Object],[object Object],[object Object],Dunn  index (to be  max ) is too sensible to extremal cases
Cluster Usability ,[object Object],[object Object],[object Object],[object Object],[object Object],Cluster  F -measure  (  Benno Stein  ) Data Expert Method Here:  i, j  are indexes of clusses and clusters C * i   , C j  are classes and clusters  prec(i,j), rec(i,j)  are precision and recall
Validity  and  Usability Conclusion Density expected measure  corresponds to  F -measure   reflecting expert ’s opinion. So,  DEM  can be an indicator of expert  opinion
Tecnologies of  Clustering ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Meta Methods Algorithm (example)  Notations: N   is  the number of objects in a given cluster D   is the diagonal of a given cluster  Initially  N 0   and their   centers   Ci  are   given Steps 1. Method  K - medoid (or any other one) is performed 2. If  N   >  N max  or  D  >  D max   (in any cluster), then this cluster is divided on 2 parts. Go to p.1 3. If  N  <  N min   or  D   <  D min  (in any cluster), then  this and the closest clusters are joined. Go to p.1 4. When the number of iteration  I  >  I max , Stop Otherwise go to p.1
Visual Clustering Clustering on dendrite  Clustering in space of factors
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Authorship References : Labbe C., Labbe D.  Inter-textual distance and authorship attribution Corneille and Molier.  Journ. of Quantitative Linguistics.  2001. Vol.8, N_3, pp.213-331
Clustering Authorship Results  1) 18 comedies of Molier  should be belonged to  Corneille   2) 15 comedies of Mollier  are weak connected with all his other works.  So, they can be written by  two authors   3) 2 comedies of Corneille now are considered as works of  Molier .  etc. Note : During a certain time Molier and Corneille were friends
Special and Universal packages with algorithms of  С lustering 1.  ClustAn  ( Scotland )  www. clustan.com   Clustan Graphics-7 (2006) 2.  MatLab  Descriptions are in Internet   3.  Statistica  Descriptions are in Internet   Learning Journals and  С ongresses about Clustering 1. Journal  “Journal of Classification”,   Springer   2. IFCS  -  International Federation of Classification Societies, Conferences 3. CSNA  -  Classification Society of North America, Seminars, Workshops
Introduction   Definitions  Clustering  Discussion Open Problems CONTENTS 
Certain Observations The numbers of methods for grouping data is a little bit more than the numbers of researchers working in this area.   Problem does not consist in searching the  best method  for all cases.  Problem consists in searching the  method being relevant  for your data.  Only you know what methods are the best for you own data . Principal problems consist in choice of indexes (parameters) and measure of closeness to be adecuate  to a given problem and given data  Frecuently the results are bad  because of the  bad indexes   and   bad measure   but not the  bad   method   !
Certain Observations Antipodal methods To be sure that results are really good and  do not depend on the method  used  one should test these results using any  antipodal  methods Solomon G, 1977: “The most antipodes are:   NN-method   and  K-means ” Sensibility To be sure that results  do not depend  essentially on the  method’s parameters   one should perform the analysis of sensibility by changing parameters  of  adjustment.
Introduction   Definitions  Clustering  Conclusions Open Problems CONTENTS 
Some Problems Question 1  How to reveal  alien   objects? Solution (idea) Revealing  a stable  structure on different sets of objects. They are subsets of a given set.   Object distribution reflects: real structure ( nature )  +  noise ( alien objects )
Some Problems Question 2  How to  accelerate  classification? Solution (idea) Filtering  objects, which give  a minimum contribution to  decisive function  Representative objects of each cluster
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

More Related Content

What's hot

Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learningamalalhait
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDataminingTools Inc
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptxNTUConcepts1
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and ldaSuresh Pokharel
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data miningKrish_ver2
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesAbhishekKumar4995
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysisKrish_ver2
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering methodrajshreemuthiah
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationDataminingTools Inc
 

What's hot (20)

Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Clustering
ClusteringClustering
Clustering
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT Slides
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 

Similar to Clustering

Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10mqasimsheikh5
 
DM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentsDM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentssriharipatilin
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478IJRAT
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfVIKASGUPTA127897
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Chapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptxChapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptxAmy Aung
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.pptSamPrem3
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.pptPalaniKumarR2
 

Similar to Clustering (20)

Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
DM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year studentsDM UNIT_4 PPT for btech final year students
DM UNIT_4 PPT for btech final year students
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
 
47 292-298
47 292-29847 292-298
47 292-298
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
My8clst
My8clstMy8clst
My8clst
 
Chapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptxChapter 10.1,2,3.pptx
Chapter 10.1,2,3.pptx
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
 
20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt20IT501_DWDM_PPT_Unit_IV.ppt
20IT501_DWDM_PPT_Unit_IV.ppt
 
ClusetrigBasic.ppt
ClusetrigBasic.pptClusetrigBasic.ppt
ClusetrigBasic.ppt
 

More from NLPseminar

[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна ЛандоNLPseminar
 
клышинский
клышинскийклышинский
клышинскийNLPseminar
 
конф ии и ея гаврилова
конф ии и ея  гавриловаконф ии и ея  гаврилова
конф ии и ея гавриловаNLPseminar
 
кудрявцев V3
кудрявцев V3кудрявцев V3
кудрявцев V3NLPseminar
 
акинина осмоловская
акинина осмоловскаяакинина осмоловская
акинина осмоловскаяNLPseminar
 
потапов
потаповпотапов
потаповNLPseminar
 
molchanov(promt)
molchanov(promt)molchanov(promt)
molchanov(promt)NLPseminar
 
белканова
белкановабелканова
белкановаNLPseminar
 
гвоздикин
гвоздикингвоздикин
гвоздикинNLPseminar
 
веселов
веселоввеселов
веселовNLPseminar
 

More from NLPseminar (20)

[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
[ИТ-лекторий ФКН ВШЭ]: Диалоговые системы. Татьяна Ландо
 
Events
EventsEvents
Events
 
Tomita
TomitaTomita
Tomita
 
бетин
бетинбетин
бетин
 
Andreev
AndreevAndreev
Andreev
 
клышинский
клышинскийклышинский
клышинский
 
конф ии и ея гаврилова
конф ии и ея  гавриловаконф ии и ея  гаврилова
конф ии и ея гаврилова
 
кудрявцев V3
кудрявцев V3кудрявцев V3
кудрявцев V3
 
rubashkin
rubashkinrubashkin
rubashkin
 
Vlasova
VlasovaVlasova
Vlasova
 
Ageev
AgeevAgeev
Ageev
 
Khomitsevich
Khomitsevich Khomitsevich
Khomitsevich
 
акинина осмоловская
акинина осмоловскаяакинина осмоловская
акинина осмоловская
 
Serebryakov
SerebryakovSerebryakov
Serebryakov
 
потапов
потаповпотапов
потапов
 
molchanov(promt)
molchanov(promt)molchanov(promt)
molchanov(promt)
 
белканова
белкановабелканова
белканова
 
Skatov
SkatovSkatov
Skatov
 
гвоздикин
гвоздикингвоздикин
гвоздикин
 
веселов
веселоввеселов
веселов
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 

Recently uploaded (20)

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 

Clustering

  • 1.
  • 2. Introduction Definitions Clustering Discussion Open Problems CONTENTS 
  • 3. Prof. Dr. Benno Stein Dr. Sven Meyer zu Eissen Weimar University, Germany Ideas, Materials, Collaboration - Structuring and Indexing - AI search in IR - Semi-Supervised Learning Dr. Xiaojin Zhu University of Wisconsin, Madison , USA
  • 4. TEXTUAL DATA Subject of Grouping NON TEXTUAL DATA Local Terminology It is not important what is the source of data: textual or non textual. Data : work in the space of numerical parameters Texts : work in the space of words Example: typical dialog between passenger ( US ) and railway directory inquires ( DI )
  • 5. TEXTS (‘indexed’) Presentation of Textual Data Vector model (‘parameterized’) Local Terminology Indexed texts are only parameterized texts in the space of words
  • 6. TEXTS (‘indexed’) Presentation of Textual Data TEXTS (‘parameterized’) Local Terminology Indexed texts are only parameterized texts in the space of themes = > Category /Context Vector Models... Example : manually parameterized dialogs in the space of parameters (transport service and passenger needs)
  • 7. Introduction Definitions Clustering Discussion Open Problems CONTENTS 
  • 8. Unsupervised Learning Types of Grouping Supervised Learning Semi-Supervised Learning We know nothing about data sructure We know well data sructure We know something about data sructure
  • 9. Clustering Classification Characteristics: Characteristics : Absence of patterns or descriptions Presence of patterns or descriptiones of classes, so the results are of classes, so the results are defined by the nature of defined by the user ( N>=1 ) the data themselves ( N >1 ) Synonyms: Synonyms : Classification without teacher Classificatio n with teacher Unsupervised learning Supervised learning Number of clusters Specials terms : [ ] is known exactly Categorization (of documents) [x] is known approximately Diagnostics (technics, medicine) [ ] is not known => searching Recognition (technics, science) Types of Grouping
  • 10. “ Semi Clustering/Classification” Classification Characteristics: Characteristics : Presence of limited number Presence of patterns or of patterns, so the results are descriptiones of classes, so the results defined both by the user or are defined by the user ( N>=1 ) by the data themselves ( N >1 ) Synonyms: Synonyms : Semi-Classification Classificatio n with teacher Semi Supervised learning Supervised learning Number of clusters/categories Specials terms : [ ] is known exactly Categorization (of documents) [x] is known approximately Diagnostics (technics, medicine) [ ] is not known => searching Recognition (technics, science) Types of Grouping
  • 11. Objectives of Grouping 1. Organization (structuring) of an object set Process is named data structuring 2. Searching interesting patterns Process is named navigation 3. Grouping for other applications: - Knowledge discovery (clustering) - Summarization of documents Note: Do not mix the type of grouping and its objective
  • 12.
  • 13. Hard grouping Hard clustering Hard categorization Soft grouping Soft clustering Soft categorization Example The distribution of letters of Moscovites to the Government is soft categorization (numbers in the table reflect the relative weight of each theme) Fuzzy Grouping
  • 14. Preprocessing <= Processing General Scheme of Clustering Process - I Principal idea : To transform texts to num erical form in order to use matematical tools Remember : our problem is grouping textual documents but not undestanding Here: Both rude and good matrixes are matrix Object/Attributes
  • 15. General Scheme of Clustering Process - II Preprocessing Processing <= Here: matrix Attribute/Attribute can be used instead of matrix Object/Object
  • 16. Matrixes to be Considered
  • 17.
  • 18.
  • 19. Importance of Preprocessing ( it takes 60%-90% of efforts)
  • 20. Introduction Definitions Clustering Discussion Open Problems CONTENTS 
  • 21. Definitions Def. 1 “Let us V be the set of objects. Clustering C = { Ci | Ci є V } of V is division of V on subsets, for which we have : U i Ci = V and Ci ∩ Cj = 0 i ≠j“ Def. 2 “Let us V be the set of nodes, E be arcs, φ is weight function that reflects the distance between objects, so we have a weighted graph G = { V,E, φ }. In this case C is named as clustering of G .” In the framework of the second definition every Ci produced subgraph G(Ci). Both subsets Ci and subgraphs G(Ci) are named clusters . Graph Set Clique
  • 22. Definitions Principal note Both definitions SAYS NOTHING : - about quality of clusters - about numbers of clusters Reason of difficulties Nowadays there is no any general agreement about any universal defintion of the term ‘ cluster ’ What means that clustering is good ? 1. Closeness between objets inside clusters is essentially more than the closeness between clusters themselves 2. Constructed clusters correspond to intuitive presentations of users ( they are natural clusters)
  • 23.
  • 24.
  • 25. Hierarchy based methods Nearest neighbor method (NN)
  • 26. Exemplar based methods K - means, centroid
  • 27.
  • 28. Method X-means (Dan Pelleg, Andrew Moor) Approach Using evaluation of object distribution Selection of the most likely points Advantage - More rapid - Number of cluster is not fixed (in all cases it tends to be less) Exemplar based methods
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. Validity and Usability Conclusion Density expected measure corresponds to F -measure reflecting expert ’s opinion. So, DEM can be an indicator of expert opinion
  • 35.
  • 36. Meta Methods Algorithm (example) Notations: N is the number of objects in a given cluster D is the diagonal of a given cluster Initially N 0 and their centers Ci are given Steps 1. Method K - medoid (or any other one) is performed 2. If N > N max or D > D max (in any cluster), then this cluster is divided on 2 parts. Go to p.1 3. If N < N min or D < D min (in any cluster), then this and the closest clusters are joined. Go to p.1 4. When the number of iteration I > I max , Stop Otherwise go to p.1
  • 37. Visual Clustering Clustering on dendrite Clustering in space of factors
  • 38.
  • 39. Clustering Authorship Results 1) 18 comedies of Molier should be belonged to Corneille 2) 15 comedies of Mollier are weak connected with all his other works. So, they can be written by two authors 3) 2 comedies of Corneille now are considered as works of Molier . etc. Note : During a certain time Molier and Corneille were friends
  • 40. Special and Universal packages with algorithms of С lustering 1. ClustAn ( Scotland ) www. clustan.com Clustan Graphics-7 (2006) 2. MatLab Descriptions are in Internet 3. Statistica Descriptions are in Internet Learning Journals and С ongresses about Clustering 1. Journal “Journal of Classification”, Springer 2. IFCS - International Federation of Classification Societies, Conferences 3. CSNA - Classification Society of North America, Seminars, Workshops
  • 41. Introduction Definitions Clustering Discussion Open Problems CONTENTS 
  • 42. Certain Observations The numbers of methods for grouping data is a little bit more than the numbers of researchers working in this area. Problem does not consist in searching the best method for all cases. Problem consists in searching the method being relevant for your data. Only you know what methods are the best for you own data . Principal problems consist in choice of indexes (parameters) and measure of closeness to be adecuate to a given problem and given data Frecuently the results are bad because of the bad indexes and bad measure but not the bad method !
  • 43. Certain Observations Antipodal methods To be sure that results are really good and do not depend on the method used one should test these results using any antipodal methods Solomon G, 1977: “The most antipodes are: NN-method and K-means ” Sensibility To be sure that results do not depend essentially on the method’s parameters one should perform the analysis of sensibility by changing parameters of adjustment.
  • 44. Introduction Definitions Clustering Conclusions Open Problems CONTENTS 
  • 45. Some Problems Question 1 How to reveal alien objects? Solution (idea) Revealing a stable structure on different sets of objects. They are subsets of a given set. Object distribution reflects: real structure ( nature ) + noise ( alien objects )
  • 46. Some Problems Question 2 How to accelerate classification? Solution (idea) Filtering objects, which give a minimum contribution to decisive function Representative objects of each cluster
  • 47.