SlideShare a Scribd company logo

Clustering Methods with R

Slides I used in a tutorial on clustering methods with R at an INDUS research network meeting on the 8th October, 2015 (https://sites.google.com/site/indusnetzwerk/events/tagung-berlin). R codes are available at http://rpubs.com/mrkm_a/ClusteringMethodsWithR

1 of 100
Download to read offline
Clustering Methods with R
Akira Murakami
Department of English Language and Applied Linguistics
University of Birmingham
a.murakami@bham.ac.uk
Download Necessary Files
2
http://tinyurl.com/ClusteringWithR
Cluster Analysis
• Cluster analysis finds groups in data.
• Objects in the same cluster are similar to each other.
• Objects in different clusters are dissimilar.
• A variety of algorithms have been proposed.
• Saying “I ran a cluster analysis” does not mean much.
• Used in data mining or as a statistical analysis.
• Unsupervised machine learning technique.
3
Cluster Analysis in SLA
• In SLA, clustering has been applied to identify the typology of
learners’
• motivational profiles (Csizér & Dörnyei, 2005),
• ability/aptitude profiles (Rysiewicz, 2008),
• developmental profiles based on international posture, L2
willingness to communicate, and frequency of communication
in L2 (Yashima & Zenuk-Nishide, 2008),
• cognitive and achievement profiles based on L1 achievement,
intelligence, L2 aptitude, and L2 proficiency (Sparks, Patton,
& Ganschow, 2012).
4
Similarity Measure
• Cluster analysis groups the observations that are
“similar”. But how do we measure similarity?
• Let’s suppose that we are interested in clustering L1
groups according to their accuracy of different
linguistic features (i.e., accuracy profile of L1
groups).
• As the measure of accuracy, we use an index that
takes the value between 0 and 1, such as the TLU
score.
5
│ │ │ │ │ │ │ │ │ │ │
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Mathematical Distance
6

Recommended

More Related Content

Similar to Clustering Methods with R

ObjRecog2-17 (1).pptx
ObjRecog2-17 (1).pptxObjRecog2-17 (1).pptx
ObjRecog2-17 (1).pptxssuserc074dd
 
Clasification approaches
Clasification approachesClasification approaches
Clasification approachesgscprasad1111
 
Teaching Constraint Programming, Patrick Prosser
Teaching Constraint Programming,  Patrick ProsserTeaching Constraint Programming,  Patrick Prosser
Teaching Constraint Programming, Patrick ProsserPierre Schaus
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...SAIL_QU
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealedinfoblog
 
Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryGiuseppe Rizzo
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data AnalysisDeviousQuant
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive ModellingRajiv Advani
 
Developing a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGISDeveloping a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGISCOGS Presentations
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Simplilearn
 
Network analysis lecture
Network analysis lectureNetwork analysis lecture
Network analysis lectureSara-Jayne Terp
 

Similar to Clustering Methods with R (20)

Ltc completed slides
Ltc completed slidesLtc completed slides
Ltc completed slides
 
ObjRecog2-17 (1).pptx
ObjRecog2-17 (1).pptxObjRecog2-17 (1).pptx
ObjRecog2-17 (1).pptx
 
Clasification approaches
Clasification approachesClasification approaches
Clasification approaches
 
Teaching Constraint Programming, Patrick Prosser
Teaching Constraint Programming,  Patrick ProsserTeaching Constraint Programming,  Patrick Prosser
Teaching Constraint Programming, Patrick Prosser
 
Data Mining Lecture_8(a).pptx
Data Mining Lecture_8(a).pptxData Mining Lecture_8(a).pptx
Data Mining Lecture_8(a).pptx
 
ictir2016
ictir2016ictir2016
ictir2016
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Database Research Principles Revealed
Database Research Principles RevealedDatabase Research Principles Revealed
Database Research Principles Revealed
 
Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom Discovery
 
Magpie
MagpieMagpie
Magpie
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data Analysis
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive Modelling
 
SVD.ppt
SVD.pptSVD.ppt
SVD.ppt
 
07 learning
07 learning07 learning
07 learning
 
Developing a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGISDeveloping a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGIS
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
DeepLearning
DeepLearningDeepLearning
DeepLearning
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 
Network analysis lecture
Network analysis lectureNetwork analysis lecture
Network analysis lecture
 

Recently uploaded

fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxfundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxPoonamRijal
 
Customer Satisfaction Data - Multiple Linear Regression Model.pdf
Customer Satisfaction Data -  Multiple Linear Regression Model.pdfCustomer Satisfaction Data -  Multiple Linear Regression Model.pdf
Customer Satisfaction Data - Multiple Linear Regression Model.pdfruwanp2000
 
What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?Denodo
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensKondapi V Siva Rama Brahmam
 
EXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDF
EXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDFEXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDF
EXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDFProject Cubicle
 
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
Artificial Intelligence for Vision:  A walkthrough of recent breakthroughsArtificial Intelligence for Vision:  A walkthrough of recent breakthroughs
Artificial Intelligence for Vision: A walkthrough of recent breakthroughsNikolas Markou
 
Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...ThinkInnovation
 
Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...DrSumathyV
 
itc limited word file.pdf...............
itc limited word file.pdf...............itc limited word file.pdf...............
itc limited word file.pdf...............mahetamanav24
 
HayleyDerby_Market_Research_Spotify.docx
HayleyDerby_Market_Research_Spotify.docxHayleyDerby_Market_Research_Spotify.docx
HayleyDerby_Market_Research_Spotify.docxHayleyDerby
 
WOMEN IN TECH EVENT : Explore Salesforce Metadata.pptx
WOMEN IN TECH EVENT : Explore Salesforce Metadata.pptxWOMEN IN TECH EVENT : Explore Salesforce Metadata.pptx
WOMEN IN TECH EVENT : Explore Salesforce Metadata.pptxyosra Saidani
 
Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...
Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...
Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...Samuel Chukwuma
 
introduction-to-crimean-congo-haemorrhagic-fever.pdf
introduction-to-crimean-congo-haemorrhagic-fever.pdfintroduction-to-crimean-congo-haemorrhagic-fever.pdf
introduction-to-crimean-congo-haemorrhagic-fever.pdfSalamaAdel
 
Ratio analysis, Formulas, Advantage PPt.pptx
Ratio analysis, Formulas, Advantage PPt.pptxRatio analysis, Formulas, Advantage PPt.pptx
Ratio analysis, Formulas, Advantage PPt.pptxSugumarVenkai
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsDataArchiva
 
Basics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelBasics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelTope Osanyintuyi
 
Choose your perfect jacket.pdf
Choose your perfect jacket.pdfChoose your perfect jacket.pdf
Choose your perfect jacket.pdfAlexia Trejo
 

Recently uploaded (18)

fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxfundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptx
 
Customer Satisfaction Data - Multiple Linear Regression Model.pdf
Customer Satisfaction Data -  Multiple Linear Regression Model.pdfCustomer Satisfaction Data -  Multiple Linear Regression Model.pdf
Customer Satisfaction Data - Multiple Linear Regression Model.pdf
 
What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample Screens
 
Electricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptxElectricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptx
 
EXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDF
EXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDFEXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDF
EXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDF
 
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
Artificial Intelligence for Vision:  A walkthrough of recent breakthroughsArtificial Intelligence for Vision:  A walkthrough of recent breakthroughs
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
 
Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...
 
Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...
 
itc limited word file.pdf...............
itc limited word file.pdf...............itc limited word file.pdf...............
itc limited word file.pdf...............
 
HayleyDerby_Market_Research_Spotify.docx
HayleyDerby_Market_Research_Spotify.docxHayleyDerby_Market_Research_Spotify.docx
HayleyDerby_Market_Research_Spotify.docx
 
WOMEN IN TECH EVENT : Explore Salesforce Metadata.pptx
WOMEN IN TECH EVENT : Explore Salesforce Metadata.pptxWOMEN IN TECH EVENT : Explore Salesforce Metadata.pptx
WOMEN IN TECH EVENT : Explore Salesforce Metadata.pptx
 
Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...
Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...
Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...
 
introduction-to-crimean-congo-haemorrhagic-fever.pdf
introduction-to-crimean-congo-haemorrhagic-fever.pdfintroduction-to-crimean-congo-haemorrhagic-fever.pdf
introduction-to-crimean-congo-haemorrhagic-fever.pdf
 
Ratio analysis, Formulas, Advantage PPt.pptx
Ratio analysis, Formulas, Advantage PPt.pptxRatio analysis, Formulas, Advantage PPt.pptx
Ratio analysis, Formulas, Advantage PPt.pptx
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data Goals
 
Basics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelBasics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft Excel
 
Choose your perfect jacket.pdf
Choose your perfect jacket.pdfChoose your perfect jacket.pdf
Choose your perfect jacket.pdf
 

Clustering Methods with R

  • 1. Clustering Methods with R Akira Murakami Department of English Language and Applied Linguistics University of Birmingham a.murakami@bham.ac.uk
  • 3. Cluster Analysis • Cluster analysis finds groups in data. • Objects in the same cluster are similar to each other. • Objects in different clusters are dissimilar. • A variety of algorithms have been proposed. • Saying “I ran a cluster analysis” does not mean much. • Used in data mining or as a statistical analysis. • Unsupervised machine learning technique. 3
  • 4. Cluster Analysis in SLA • In SLA, clustering has been applied to identify the typology of learners’ • motivational profiles (Csizér & Dörnyei, 2005), • ability/aptitude profiles (Rysiewicz, 2008), • developmental profiles based on international posture, L2 willingness to communicate, and frequency of communication in L2 (Yashima & Zenuk-Nishide, 2008), • cognitive and achievement profiles based on L1 achievement, intelligence, L2 aptitude, and L2 proficiency (Sparks, Patton, & Ganschow, 2012). 4
  • 5. Similarity Measure • Cluster analysis groups the observations that are “similar”. But how do we measure similarity? • Let’s suppose that we are interested in clustering L1 groups according to their accuracy of different linguistic features (i.e., accuracy profile of L1 groups). • As the measure of accuracy, we use an index that takes the value between 0 and 1, such as the TLU score. 5
  • 6. │ │ │ │ │ │ │ │ │ │ │ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Mathematical Distance 6
  • 7. │ │ │ │ │ │ │ │ │ │ │ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 L1 Korean Mathematical Distance 7
  • 8. │ │ │ │ │ │ │ │ │ │ │ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 L1 Korean L1 German Mathematical Distance 8
  • 9. │ │ │ │ │ │ │ │ │ │ │ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 L1 Korean L1 German Distance = 0.2 Mathematical Distance 9
  • 10. │ │ │ │ │ │ │ │ │ │ │ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 L1 Korean L1 German Distance = 0.2 L1 Japanese Distance = 0.1 Mathematical Distance 10
  • 11. (Dis)Similarity Matrix 11 L1 Korean L1 German L1 Japanese L1 Korean 0.0 L1 German 0.2 0.0 L1 Japanese 0.1 0.3 0.0
  • 12. Distance Measures • Things are simple in 1D, but get more complicated in 2D or above. • Different measures of distance • Euclidean distance • Manhattan distance • Maximum distance • Mahalanobis distance • Hamming distance • etc 12
  • 13. Distance Measures • Things are simple in 1D, but get more complicated in 2D or above. • Different measures of distance • Euclidean distance • Manhattan distance • Maximum distance • Mahalanobis distance • Hamming distance • etc 13
  • 14. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Euclidean Distance 14
  • 15. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 L1 German (0.8, 0.6) Euclidean Distance 15
  • 16. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 L1 German (0.8, 0.6) L1 Korean (0.4, 0.8) Euclidean Distance 16
  • 17. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 L1 German (0.8, 0.6) L1 Korean (0.4, 0.8) (0.4−0.8)2 +(0.8−0.6)2 Euclidean Distance 17
  • 18. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 L1 German (0.8, 0.6) L1 Korean (0.4, 0.8) 0.45 Euclidean Distance 18
  • 19. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 L1 German (0.8, 0.6) L1 Korean (0.4, 0.8) 0.45 L1 Japanese (0.6, 0.5) Euclidean Distance 19
  • 20. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 L1 German (0.8, 0.6) L1 Korean (0.4, 0.8) 0.45 L1 Japanese (0.6, 0.5) 0.36 0.22 Euclidean Distance 20
  • 21. (Dis)Similarity Matrix 21 L1 Korean L1 German L1 Japanese L1 Korean 0.00 L1 German 0.45 0.00 L1 Japanese 0.36 0.22 0.00
  • 22. 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy Plural−sAccuracy L1 German (0.3, 0.6, 0.9) L1 Korean (0.6, 0.9, 0.6) L1 Japanese (0.9, 0.4, 0.5) Euclidean Distance (3D) 22
  • 23. 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy Plural−sAccuracy L1 German (0.3, 0.6, 0.9) L1 Korean (0.6, 0.9, 0.6) L1 Japanese (0.9, 0.4, 0.5) 0.75 0.52 0.59 Euclidean Distance (3D) 23
  • 24. (Dis)Similarity Matrix 24 L1 Korean L1 German L1 Japanese L1 Korean 0.00 L1 German 0.52 0.00 L1 Japanese 0.59 0.75 0.00
  • 25. Distance Measures • Things are simple in 1D, but get more complicated in 2D or above. • Different measures of distance • Euclidean distance • Manhattan distance • Maximum distance • Mahalanobis distance • Hamming distance • etc 25
  • 26. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 L1 German (0.8, 0.6) L1 Korean (0.4, 0.8) Manhattan Distance 26
  • 27. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 L1 German (0.8, 0.6) L1 Korean (0.4, 0.8) Manhattan Distance 27
  • 28. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 L1 German (0.8, 0.6) L1 Korean (0.4, 0.8) 0.4 0.2 Manhattan Distance 28
  • 29. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 L1 German (0.8, 0.6) L1 Korean (0.4, 0.8) 0.4 0.2 Manhattan Distance 29 → Distance = 0.4 + 0.2 = 0.6
  • 30. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 (0.1, 0.4) (0.9, 0.3) (0.6, 0.9) Manhattan Distance 30
  • 31. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 (0.1, 0.4) (0.9, 0.3) (0.6, 0.9) 0.5 0.5 0.71 0.1 0.8 0.81 Manhattan Distance 31
  • 32. Article Accuracy Pasttense−edAccuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 (0.1, 0.4) (0.9, 0.3) (0.6, 0.9) 0.5 0.5 0.71 0.1 0.8 0.81 Manhattan Distance 32 Euclidean: 0.71 Manhattan: 0.5 + 0.5 = 1.00 Euclidean: 0.81 Manhattan: 0.1 + 0.8 = 0.90
  • 33. dist() • In R, dist function is used to obtain dissimilarity matrices. • Practicals 33
  • 34. Clustering Methods • Now that we know the concept of similarity, we move on to the clustering of objects based on the similarity. • A number of methods have been proposed for clustering. We will look at the following two: • agglomerative hierarchical cluster analysis • k-means 34
  • 35. Clustering Methods • Now that we know the concept of similarity, we move on to the clustering of objects based on the similarity. • A number of methods have been proposed for clustering. We will look at the following two: • agglomerative hierarchical cluster analysis • k-means 35
  • 36. Agglomerative Hierarchical Cluster Analysis • In agglomerative hierarchical clustering, observations are clustered in a bottom-up manner. 1. Each observation forms an independent cluster at the beginning. 2. The two clusters that are most similar are clustered together. 3. 2 is repeated until all the observations are clustered in a single cluster. 36
  • 37. Linkage Criteria • How do we calculate the similarity between clusters that each includes multiple observations? • Ward’s criterion (Ward’s method) • complete-linkage • single-linkage • etc. 37
  • 38. Linkage Criteria • How do we calculate the similarity between clusters that each includes multiple observations? • Ward’s criterion (Ward’s method) • complete-linkage • single-linkage • etc. 38
  • 39. Ward’s Method • Ward’s method leads to the smallest within-cluster variance. • At each iteration, two clusters are merged so that it yields the smallest increase of the sum of squared errors. • Sum of Squared Errors (SSE): the sum of the squared difference between the mean of the cluster and individual data points. 39
  • 40. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) Ward’s Method 40
  • 41. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) Ward’s Method 41
  • 42. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) x mean (0.3, 0.6) Ward’s Method 42
  • 43. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) x mean (0.3, 0.6) 0.22 0.22 Ward’s Method 43
  • 44. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) x mean (0.3, 0.6) 0.05 0.05 Ward’s Method 44
  • 45. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) x mean (0.3, 0.6) 0.05 0.05 Ward’s Method 45→ 0.05 + 0.05 = 0.10
  • 46. • This procedure is repeated for all of the pairs. 46
  • 47. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) x Ward’s Method 47
  • 48. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) x x Ward’s Method 48
  • 49. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) x x (0.3, 0.3) (0.6, 0.8) Ward’s Method 49
  • 50. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) x x (0.3, 0.3) (0.6, 0.8) ( 0.1 2 +0.1 2 ) 2 = 0.02 0.2 2 = 0.04 Ward’s Method 50
  • 51. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) x x (0.3, 0.3) (0.6, 0.8) ( 0.1 2 +0.1 2 ) 2 = 0.02 0.2 2 = 0.04 Ward’s Method SSE = 0.02 + 0.02 + 0.04 + 0.04 = 0.12
  • 52. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) x (0.45, 0.55) Ward’s Method
  • 53. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) x (0.45, 0.55) 0.12 0.08 0.06 0.18 Ward’s Method SSE = 0.12 + 0.08 + 0.06 + 0.18 = 0.46
  • 54. ΔSSE • SSE before the merger: 0.12 • SSE after the merger: 0.46 • Difference (ΔSSE): 0.46 - 0.12 = 0.34 54
  • 55. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) x x Ward’s Method 55
  • 58. EF-Cambridge Open Language Database (EFCAMDAT) 58 • Writings submitted at Englishtown, the online school of Education First • 16 Levels × 8 Units (A1-C2 in CEFR) • Each student submits one writing per unit • Teachers’ feedback available on some writings (≈ error tags) • Available at http://corpus.mml.cam.ac.uk/ efcamdat/
  • 59. Linkage Criteria • How do we know the similarity between clusters that each includes multiple observations? • Ward’s criterion (Ward’s method) • complete-linkage • single-linkage • etc. 59
  • 60. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) Complete Linkage 60
  • 61. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) Complete Linkage 61
  • 62. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) 0.7 Complete Linkage 62
  • 63. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) 0.4 Single Linkage 63
  • 64. Potential Pitfall of Hierarchical Clustering • It assumes hierarchical structure in the clustering. • Let us say that our data included two L1 groups over three proficiency levels. • If we group the data into two clusters, the best split may be between the two L1 groups. • If we group them into three clusters, the best groups may be by proficiency groups. • In this case, three-cluster solution is not nested within two- cluster solution, and hierarchical clustering may fail to identify the two clusters. 64
  • 66. k-means Clustering • K-means clustering does not assume a hierarchical structure of clusters. • i.e., no parent/child clusters • Analysts need to specify the number of clusters. 66
  • 67. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 (0.4, 0.2) 2 (0.2, 0.4) 3 (0.4, 0.8) 4 (0.8, 0.8) 5 (0.9, 0.4) k-means Clustering 67
  • 68. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy x x 1 2 3 4 5 (Centroid 1) (Centroid 2) k-means Clustering 68
  • 69. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy x x 1 2 3 4 5 (Centroid 1) (Centroid 2) 0.28 0.60 0.45 0.72 0.72 0.64 0.70 k-means Clustering 69
  • 70. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 2 3 4 5 x x Centroid 1 Centroid 2 k-means Clustering 70
  • 71. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 2 3 4 5 x x Centroid 1 Centroid 2 0.40 0.41 0.50 0.22 0.45 0.22 0.28 0.42 0.21 0.63 k-means Clustering 71
  • 72. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Article Accuracy Pasttense−edAccuracy 1 2 3 4 5 x x Centroid 1 Centroid 2 k-means Clustering 72
  • 73. k-Means Clustering • The optimal number of clusters depends on the intended use. • There is no “correct” or “wrong” choice in the number of clusters. • NP hard • The algorithm only approximates solutions. • Randomness is involved in the solution. You get different solutions every time you run it. • It assumes convex clusters. 73
  • 74. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 y1 Concave 74
  • 75. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 y1 x x xx x x x x x xx x x x x x x x x xx x xx x x x xx x x xx x x x xx x x x x x x x x x x xx x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x xx x xx x x x x x x x x x x x x x x x x x x xx x x xx x x x x x x x x x x x x x x x x x x xx x xx x x x x x x x x x x x x x x x x x x x x x x x xx x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Concave 75
  • 77. Within-Learner Centering • The mean accuracy value of each learner was subtracted from all the data points of the learner. • For example, let's suppose the mean sentence length (MSL) of Learner A over 10 writings was • {4.0, 4.2, 4.4, 4.6, 4.8, 5.0, 5.2, 5.4, 5.6, 5.8} 
 
 and that of Learner B was • {8.0, 8.2, 8.4, 8.6, 8.8, 9.0, 9.2, 9.4, 9.6, 9.8} • The difference in MSL is identical in the two learners (+0.2 per writing). • But the absolute MSL is widely different. 77
  • 78. Within-Learner Centering • The mean value of Learner A (4.9) is subtracted from all the data points of Learner A: • → {-0.90, -0.70, -0.50, -0.30, -0.10, 0.10, 0.30, 0.50, 0.70, 0.90}. • Similarly, the mean value of Learner B (8.90) is subtracted from all the data points of Learner B: • → {-0.90, -0.70, -0.50, -0.30, -0.10, 0.10, 0.30, 0.50, 0.70, 0.90}. • It is guaranteed that these two learners are clustered into the same group as they have exactly the same set of values. 78
  • 80. Cluster Validation/Evaluation • We got clusters and explored them, but how do we know how good the clusters are, or whether they indeed capture signal and not just noise? • Are the clusters ‘real’? • Is it the difference in the true learning curve that the earlier clustering captured or is it just the random noise? 80
  • 81. Two Types of Validation • External Validation • Internal Validation 81
  • 82. External Validation • If there is a a systematic pattern between clusters and some external criteria, such as the proficiency or L1 of learners, then what the cluster analysis captured is unlikely to be just noise. 82
  • 83. Internal Validation • Measures of goodness of clusters • silhouette width • Davies–Bouldin index • Dunn index • etc. 83
  • 84. Internal Validation • Measures of goodness of clusters • silhouette width • Davies–Bouldin index • Dunn index • etc. 84
  • 85. Silhouette Width • Intuitively, the silhouette value is large if within- cluster dissimilarity is small (i.e., learners within each cluster have similar developmental trajectories) and between-cluster dissimilarity is large (i.e., learners in different clusters have different learning curves). • The silhouette is given to each data point (i.e., learner), and all the silhouette values are averaged to measure the cluster distinctiveness of a cluster analysis. 85
  • 86. • Let’s say there are three clusters, A through C. • Let’s further say that i is a member of Cluster A. • Let a(i) be the average distance between that learner and all the other learners that belong to the same cluster. • We also calculate the average distances 1. between the learner and all the other learners that belong to Cluster B 2. between the learner and all the other learners that belong to Cluster C • Let b(i) be the smaller of the two above (1-2). • s(i) = (b(i) - a(i)) / max(a(i), b(i)) 86 Silhouette Width
  • 87. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 y1 x x x x x x x x x x x x x x x x x x x x Silhouette Width 87
  • 88. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 y1 x x x x x x x x x x x x x x x x x x x x Silhouette Width 88
  • 89. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 y1 x x x x x x x x x x x x x x x x x x x x Silhouette Width 89 → Average = 0.022 (the value of a(i))
  • 90. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 y1 x x x x x x x x x x x x x x x x x x x x Silhouette Width 90 → Average = 0.191
  • 91. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 y1 x x x x x x x x x x x x x x x x x x x x Silhouette Width 91 → Average = 0.240
  • 92. Silhouette Width • a(i) = 0.022 • b(i) = 0.191 (the smaller of the other two) • s(i) = (b(i) - a(i)) / max(a(i), b(i)) • s(i) = (0.191 - 0.022) / 0.191 = 0.882 • This is repeated for all the data points. • Goodness of clustering: mean silhouette width across all the data points. 92
  • 93. Bootstrapping • Now that we have a measure of how good our clustering is, the next question is whether it is good enough to be considered non-random. • We can address this question through the technique called bootstrapping. • The idea is similar to the usual hypothesis-testing procedure. • We obtain the null distribution of the silhouette value and see where our value falls. 93
  • 94. • More specific procedure is as follows: 1. For each learner, we sample 30 writings (with replacement). 2. We run a k-means cluster analysis with the data obtained in 1 and calculate the mean silhouette value. 3. 1 and 2 are repeated e.g., 10,000 times, resulting in 10,000 mean silhouette values which we consider as the null distribution. 4. We examine whether the 95% range of 3 includes our observed mean silhouette value. 94 Bootstrapping
  • 95. • The idea here is that we practically randomize the order of the writings within individual learners and follow the same procedure as our main analysis. • Since the order of writings is random, there should not be any systematic pattern of development observed. • The clusters obtained in this manner thus captures noise alone. We calculate the mean silhouette value on the noise-only, random clusters, and obtain its distribution by repeating the whole procedure a large number of times. 95 Bootstrapping