- 1. UNSUPERVISED MACHINE LEARNING VITOMIR KOVANOVIĆ UNIVERSITY OF SOUTH AUSTRALIA #vkovanovic vitomir.kovanovic.info Vitomir.Kovanovic@unisa.edu.au LEARNING ANALYTICS SUMMER INSTITUTE TEACHERS COLLEGE, COLUMBIA UNIVERSITY JUNE 11-13, 2018 SREĆKO JOKSIMOVIĆ UNIVERSITY OF SOUTH AUSTRALIA #s_joksimovic www.sjoksimovic.info Srecko.Joksimovic@unisa.edu.au 1
- 2. About me • Learning analytics researcher • Research Fellow, School of Education, UniSA Data Scientist, Teaching Innovation Unit, UniSA Member of the Centre for Change and Complexity in Learning (C3L) • Member of the SoLAR executive board • Computer science and information systems background • Used cluster analysis in several research projects 2
- 3. About you • Introduce yourself • Name, affiliation, position • Experience with machine learning and clustering • Experience with Weka or some other ML/DM toolkit • Ideas for clustering in your own research/work 3
- 4. Download data from Dropbox or USB http://bit.ly/lak18ul 4
- 5. Workshop outline 1. Three days, four sessions 2. Equally theoretical and practical 3. Use of Weka Machine Learning toolkit 4. Focus on practical use 5. Examples of clustering use in learning analytics 5
- 6. Workshop topics • Introduction to machine learning & unsupervised methods • Introduction to cluster analysis • Overview of cluster analysis use in Learning Analytics • Introduction to WEKA toolkit • Overview of the tutorial dataset • K-means algorithm • K-means demo • Hierarchical clustering algorithms • Hierarchical clustering demo 6
- 7. Tutorial topics • How to choose the number of clusters • How to interpret clustering results • Practical challenges • More advanced cluster analysis approaches • Statistical methods for comparing clusters • Clustering real-world data from OU UK • Discussing different cluster analysis methods 7
- 9. What is machine learning 9
- 10. What is machine learning? Computing method for making sense of the data 10
- 11. Data is everywhere Each minute: ● 3,600,000 Google searches ● 456,000 Twitter posts ● 46,740 Instagram photos ● 45,787 Uber trips ● 600 new Wikipedia edits ● 13 new Spotify songs Domo (2017). “Data Never Sleeps 5.0” https://www.domo.com 11
- 12. What products should go on sale? Grouping of related products 12
- 13. What movies to recommend? Grouping users based on their viewing preferences 13
- 14. How to navigate streets? Processing multiple streams of information in real time 14
- 15. Online store product recommendation 15
- 16. Fields that influenced machine learning • Statistics • Operations research • Artificial intelligence • Data visualisation • Software engineering • Information systems management 16
- 17. How machine learning works? WILL IT TAKE MY JOB? 17
- 18. Two key ideas in machine learning 1.Features 2.Models 18
- 19. What is a feature? 1. A feature is a characteristic of a data point 2. Each data point is represented as a vector of features [f1, f2, f3 ... fm] 3. A whole dataset of N data points is represented as a N x M matrix Data point Feature 1 Feature 2 .... Feature M Data point 1 Data point 2 .... Data point N 19
- 20. What is a feature? • Performance of machine learning algorithms in large part depend on the quality of extracted features (how useful they are for a given ML task) • Expertise and prior knowledge come into play when deciding which features to extract 20
- 21. What is a model? • Something that capture important patterns in the data • A model can be used to • Draw inferences • Understand the data • Learn hidden rules • Support decision making 21
- 22. An example model: BMI calculator • Goal: Predicting a person’s body fat category (overweight, normal, or underweight) from his height (in m) and weight (in kg). • Model: • BMI = weight / height2 • If BMI > 25: overweight • If BMI < 18.5: underweight • Otherwise: normal • An example: 1.75m and 70kg: BMI: 70/(1.75*1.75) = 22.85 -> Normal category 22
- 23. ML Model How machine learning works? Slow and hard Model development Model use Response (Prediction) A new data point Fast and easy ML Model Model buildingN data points feature extraction NxM feature matrix feature extraction feature vector of length M 23
- 24. Two types of errors • Bias: The error from erroneous assumptions of the model. • High bias: miss the relevant relationships between variables (underfitting). • Variance: The error from sensitivity to small fluctuations in the data. • High variance: modelling the random noise in the data, rather than real relationships (overfitting). 24
- 25. Two types of errors • We always work with samples • Samples always have noise • The trick is to develop models that do not fit training data, but new future data 25
- 26. Two types of errors 26 High bias High variance
- 27. The trick is to find optimal model complexity 27
- 28. Key machine learning approaches 1. Supervised machine learning 1. Predicting categorical value: Classification 2. Predicting continuous value: Regression 2. Unsupervised machine learning 1. Grouping data points (rows): Cluster analysis 2. Grouping features (columns): Principal Component Analysis (PCA), Factor Analysis (FA), Latent semantic analysis (LSA), Singular Value Decomposition (SVD) 28
- 29. Many more approaches • Models that blur the division between supervised and unsupervised • Reinforcement learning: learning the class label after making a prediction • Neural networks (can be supervised and unsupervised) • Online learning models: learning as data arrives • Feature processing methods: association rule mining 29
- 30. Supervised learning example NOT A GRAD SCHOOL 30
- 31. 10 data points Data point 1 Data point 2 Data point 3 Data point 4 Data point 5 Data point 6 Data point 7 Data point 8 Data point 9 Data point 10 31
- 32. ML Model How machine learning works? Slow and hard Model development Model use Response (Prediction) A new data point Fast and easy ML Model Model buildingN data points feature extraction NxM feature matrix feature extraction feature vector of length M 32
- 33. First step: feature extraction • From each data point we extracted four features: • Number of wheels • Colour • Top speed (in km/h) • Weight (in kg) • Our feature matrix is 10 x 4 ID Wheels Color Top speed (km/h) Weight (kg) 1 4 Yellow 220 1,200 2 4 Red 180 950 3 2 Blue 260 230 4 2 Red 210 320 5 4 Yellow 160 870 6 4 Blue 170 750 7 4 Red 190 850 8 2 Yellow 140 140 9 2 Yellow 210 310 10 2 Red 240 280 33
- 34. Supervised learning: classification • Each data point is provided with a continuous numerical label (outcome variable) • The goal is to predict the class label for a new data point ID Wheels Color Top speed (km/h) Weight (kg) Label 1 4 Yellow 220 1,200 Car 2 4 Red 180 950 Car 3 2 Blue 260 230 Bike 4 2 Red 210 320 Bike 5 4 Yellow 160 870 Car 6 4 Blue 170 750 Car 7 4 Red 190 850 Car 8 2 Yellow 140 140 Bike 9 2 Yellow 210 310 Bike 10 2 Red 240 280 Bike [4, Yellow, 260, 1100] ?Car ID Wheels Color Top speed (km/h) Weight (kg) 1 4 Yellow 220 1,200 2 4 Red 180 950 3 2 Blue 260 230 4 2 Red 210 320 5 4 Yellow 160 870 6 4 Blue 170 750 7 4 Red 190 850 8 2 Yellow 140 140 9 2 Yellow 210 310 10 2 Red 240 280 We learned a model to classify a new (unseen) vehicle as either a car or a bike 34
- 35. Supervised learning: regression • The goal is to predict the outcome value for a new data point [4, Yellow, 260, 1100] ?140,000 ID Wheels Color Top speed (km/h) Weight (kg) Label 1 4 Yellow 220 1,200 120,000 2 4 Red 180 950 40,000 3 2 Blue 260 230 63,000 4 2 Red 210 320 53,000 5 4 Yellow 160 870 21,000 6 4 Blue 170 750 37,000 7 4 Red 190 850 21,000 8 2 Yellow 140 140 26,000 9 2 Yellow 210 310 68,000 10 2 Red 240 280 75,000 ID Wheels Color Top speed (km/h) Weight (kg) 1 4 Yellow 220 1,200 2 4 Red 180 950 3 2 Blue 260 230 4 2 Red 210 320 5 4 Yellow 160 870 6 4 Blue 170 750 7 4 Red 190 850 8 2 Yellow 140 140 9 2 Yellow 210 310 10 2 Red 240 280 We learned a model to predict a price of a new (unseen) vehicle 35
- 36. Unsupervised learning example SAME BUT DIFFERENT 36
- 37. Unsupervised learning: clustering • We want algorithm to group data points into several groups based on their similarity ID Wheels Color Top speed (km/h) Weight (kg) Group 1 4 Yellow 220 1,200 ? 2 4 Red 180 950 ? 3 2 Blue 260 230 ? 4 2 Red 210 320 ? 5 4 Yellow 160 870 ? 6 4 Blue 170 750 ? 7 4 Red 190 850 ? 8 2 Yellow 140 140 ? 9 2 Yellow 210 310 ? 10 2 Red 240 280 ? ID Wheels Color Top speed (km/h) Weight (kg) 1 4 Yellow 220 1,200 2 4 Red 180 950 3 2 Blue 260 230 4 2 Red 210 320 5 4 Yellow 160 870 6 4 Blue 170 750 7 4 Red 190 850 8 2 Yellow 140 140 9 2 Yellow 210 310 10 2 Red 240 280 [4, Yellow, 260, 1100] ?1 ID Wheels Color Top speed (km/h) Weight (kg) Group 1 4 Yellow 220 1,200 1 2 4 Red 180 950 1 3 2 Blue 260 230 2 4 2 Red 210 320 2 5 4 Yellow 160 870 1 6 4 Blue 170 750 1 7 4 Red 190 850 1 8 2 Yellow 140 140 2 9 2 Yellow 210 310 2 10 2 Red 240 280 2 Interpretation of group meaning is up to the researcher 1=?, 2=? 37
- 38. Unsupervised learning: clustering • We want algorithm to group data points into several groups based on their similarity ID Wheels Color Top speed (km/h) Weight (kg) Group 1 4 Yellow 220 1,200 ? 2 4 Red 180 950 ? 3 2 Blue 260 230 ? 4 2 Red 210 320 ? 5 4 Yellow 160 870 ? 6 4 Blue 170 750 ? 7 4 Red 190 850 ? 8 2 Yellow 140 140 ? 9 2 Yellow 210 310 ? 10 2 Red 240 280 ? ID Wheels Color Top speed (km/h) Weight (kg) 1 4 Yellow 220 1,200 2 4 Red 180 950 3 2 Blue 260 230 4 2 Red 210 320 5 4 Yellow 160 870 6 4 Blue 170 750 7 4 Red 190 850 8 2 Yellow 140 140 9 2 Yellow 210 310 10 2 Red 240 280 [4, Yellow, 260, 1100] ?2 ID Wheels Color Top speed (km/h) Weight (kg) Group 1 4 Yellow 220 1,200 2 2 4 Red 180 950 1 3 2 Blue 260 230 2 4 2 Red 210 320 2 5 4 Yellow 160 870 1 6 4 Blue 170 750 1 7 4 Red 190 850 1 8 2 Yellow 140 140 1 9 2 Yellow 210 310 2 10 2 Red 240 280 2 Pick the grouping of data that is most useful for your own purpose 38
- 39. Introduction to cluster analysis WE NEED TO START SOMEWHERE 39
- 40. Homogeneity Cluster Distance Unsupervised categorization: no predefined classes What is Cluster Analysis? 40
- 41. History • Anthropology • Biology • Computer Science • Statistics • Mathematics • Medicine • Psychology • Engineering 41
- 42. Primary goals Gain insights Categorize Compress 42
- 43. Social Sciences • Improve understanding of a domain • Compress and summarize large datasets • Within Learning Analytics: • Profile learners based on their course engagement, • Discover emerging topics in a corpus (student discussions, course materials) • Group courses based on their characteristics 43
- 45. Medicine and genetics Clustering patients, symptoms, gene expressions 45
- 46. Marketing and customer analysis Understanding customer populations 46
- 47. Newspapers and document analysis Grouping related news articles and summarizing large collections of documents 47
- 48. Earthquake prediction Identifying sources of earthquakes 48
- 49. Urban planning Grouping building based on their properties 49
- 50. What is not clustering? • Simple data partitioning • Single property • Predefined groups • Data clustering • Multiple properties • Unforeseen groups • Combinations of properties describe groups 50
- 51. Important concepts CLUSTERING IS TRICKY BUSINESS 51
- 52. Cluster ambiguity • How many clusters? 52
- 53. Cluster ambiguity • How many clusters? Two-cluster solution 53
- 54. Cluster ambiguity • How many clusters? Four-cluster solution 54
- 55. Cluster ambiguity • How many clusters? Six-cluster solution 55
- 56. Which one? 56
- 57. Cluster separation and stability 57
- 58. Representing a cluster • Centroid – a geometrical centre of a cluster • Medoid – data point closest to the centroid 58
- 59. What is mean by similar? • What is meant by “similar data points”? • Geometry – More similar data points are closer to each other in N- dimensional feature space • Yes, but: • Close to the cluster “centre” • Closeness to any other data point in a cluster • Is it about distance between data points or their special density? 59
- 60. Any data point or centre 60
- 61. Types of clustering approaches THERE ARE CLUSTERS OF CLUSTERING METHODS 61
- 62. Different types of clustering methods • Membership strictness • Hard clustering • Each object either belongs to a cluster or not • Soft (fuzzy) clustering • Each object belongs to each cluster to some degree 62
- 63. Different types of clustering methods • Membership exclusivity • Strict partitioning clustering (e.g. K-means) • Each object belongs to one and only one cluster • Strict partitioning clustering with outliers • Each objects belongs to zero or one clusters 63
- 64. Different types of clustering methods • Overlapping clustering • Each object can belong to one or more “hard” clusters • Hierarchical clustering • Objects belonging to a child cluster also belong to the parent cluster 64
- 65. Different types of clustering methods • Distance-based clustering • Group objects based on distance among them • Density-based clustering • Group objects based on area they occupy 65
- 66. Special clustering approaches MAAANY more approaches • Model-based clustering: • EM clustering • Neural network approaches – Self-organising maps • Grid-based approaches (e.g., STING) • Clustering algorithms for large datasets • Clustering of stream data in real time • Clustering (partitioning) approaches for different types of data (e.g., graphs) • Clustering approaches for categorical data • Clustering approaches for freeform clusters (e.g., CURE) • Clustering approaches for high-dimensional data (e.g., CLIQUE, PROCLUS) • Constraint-based clustering • Semi-supervised clustering 66
- 67. Multivariate methods • N Data points have M features • Find K clusters so that • Each data point is associated to each of the K clusters to a certain degree (0 – none, 1.0 – fully) • Each of the K clusters is associated with all M features to a certain degree • Find K which maximizes the likelihood of the observed data 67
- 68. Neural network approaches • Network of connected nodes that propagate signals • Edges have coefficients that alter signal propagation • Traditionally supervised learning method • Backpropagation method of learning coefficients • Learning method and network structure altered to support unsupervised learning • Nodes can move! • Eventually position of nodes indicate location of clusters 68
- 69. Graph partitioning • Partitioning network into subgraphs • Goal to have highly dense subgraphs with few connections between them 69
- 70. Popular distance metric • Way of calculating similarity between different data points. • Important for methods based on distances (e.g., K-Means, Hierarchical clustering) • Have a significant effect on the final clustering results Distance metric Formula Euclidean distance 𝑎 − 𝑏 2 = 𝑖 𝑎𝑖 − 𝑏𝑖 2 Squared Euclidean distance 𝑎 − 𝑏 2 2 = 𝑖 𝑎𝑖 − 𝑏𝑖 2 Manhattan (Hammington) distance 𝑎 − 𝑏 1 = 𝑖 𝑎𝑖 − 𝑏𝑖 Maximum distance 𝑎 − 𝑏 ∞ = max 𝑖 𝑎𝑖 − 𝑏𝑖 70
- 71. Distance metrics example • Euclidean: 42 + 32 = 5 • Square Euclidean: 42 + 32 = 25 • Manhattan: 4 + 3 = 7 • Maximum: max 4, 3 = 4 3 4 71
- 72. Clustering for Learning Analytics • Grouping of: • Students • Demographics • Behavior • Preferences • Courses taken • Academic performance • Resources • Reading materials • Discussions • Courses • Course design improvement 72
- 73. Some examples Kovanović, V., Joksimović, S., Gašević, D., Owers, J., Scott, A.-M., & Woodgate, A. (2016). Profiling MOOC course returners: How does student behaviour change between two course enrolments? In Proceedings of the Third ACM Conference on Learning @ Scale (L@S’16) (pp. 269–272). New York, NY, USA: ACM. https://doi.org/10.1145/2876034.2893431 73
- 74. Dataset • 28 offerings of 11 different Coursera MOOCs at the University of Edinburgh • 26,025 double course enrolment records • 52,050 course enrolment records • K-means clustering • Too large for clustering methods that use pairwise distances (e.g., hierarchical clustering) # Course Offering 1 Artificial Intelligence Planning 1,2,3 2 Animal Behavior and Welfare 1,2 3 AstroTech: The Science and Technology behind Astronomical Discovery 1,2 4 Astrobiology and the Search for Extraterrestrial Life 1,2 5 The Clinical Psychology of Children and Young People 1,2 6 Critical Thinking in Global Challenges 1,2,3 7 E-learning and Digital Cultures 1,2,3 8 EDIVET: Do you have what it takes to be a veterinarian? 1,2 9 Equine Nutrition 1,2,3 10 Introduction to Philosophy 1,2,3,4 11 Warhol 1,2 74
- 75. Extracted features • 9 different features extracted Feature Description Days No. of days active Sub. No. of submitted assignments Wiki No. of wiki page views Disc. No. of discussion views Posts No. of discussion messages written Quiz. No. of quizzes attempted Quiz.Uni. No. of different quizzes attempted Vid.Uni. No. of different videos watched Vid. No. of videos watched 75
- 76. Results 76
- 77. Results Cluster label Students Students % Enrol only (E) 22,932 44.1 Low engagement (LE) 21,776 41.8 Videos & Quizzes (VQ) 2,120 4.1 Videos (V) 5,128 9.9 Social (S) 94 0.2 77
- 78. Some examples Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities of inquiry: Effects of learning technology use on cognitive presence in asynchronous online discussions. The Internet and Higher Education, 27, 74–89. https://doi.org/10.1016/j.iheduc.2015.06.002 78
- 79. Clustering features # Type Code 1 Clustering variables (content) ULC UserLoginCount Total number of times student logged into the system. 2 CVC CourseViewCount Total number of times student viewed general course information. 3 AVT AssignmentViewTime Total time spent on all course assignments. 4 AVC AssignmentViewCount Total number of times student opened one of the course assignments. 5 RVT ResourceViewTime Total time spent on reading the course resources. 6 RVC ResourceViewCount Total number of times student opened one of the course resource materials. 7 Clustering variables (discussion) FSC ForumSearchCount Total number of times student used search function on the discussion boards. 8 DVT DiscussionViewTime Total time spent on viewing course’s online discussions. 9 DVC DiscussionViewCount Total number of time student opened one of the course’s online discussions. 10 APT AddPostTime Total time spent on posting discussion board messages. 11 APC AddPostCount Total number of the discussion board messages posted by the student. 12 UPT UpdatePostTime Total time spent on updating one of his discussion board messages. 13 UPC UpdatePostCount Total number of times student updated one of his discussion board messages. 79
- 80. Results 80
- 81. Cluster interpretations # Size Label 1 21 Task-focused users Overall below average activity, Above average message posting activity 2 15 Content-focused users Below average discussions-related activity, Average content-related activity, emphasis on assignments 3 22 No-users Overall below average activity, slightly bigger in discussion-related activities 4 3 Highly intensive users Significantly most active students, especially in content-related activities 5 6 Content-focused intensive users Above average content-related activity, Average discussion-related activity 6 14 Socially-focused intensive users Above average discussion-related activity, Average content-related activity 81
- 82. Some examples Almeda, M. V., Scupelli, P., Baker, R. S., Weber, M., & Fisher, A. (2014). Clustering of Design Decisions in Classroom Visual Displays. In Proceedings of the Fourth International Conference on Learning Analytics and Knowledge (pp. 44–48). New York, NY, USA: ACM. https://doi.org/10.1145/2567574.2567605 82
- 83. Clustering visual designs of classrooms • 30 schools in northwestern USA • Classroom Wall Coding System, CWaCS 1.0. • Each classroom wall was photographed • Units of analysis were marked with a box • Coding scheme: 1. Academic 1. Academic topics (F1) 2. Academic organizational (F2) 2. Non-academic (F3) 3. Behavioural (F4) • Adopted K-means to cluster classrooms based on frequency of four features (F1-F4) • Academic organizational 1. Goals for the day 2. Group assignments 3. Job charts 4. Labels 5. Schedule day/week 6. Yearly 7. Schedule 8. Skills 9. Homework • Behavior materials 1. Behavior management 2. Progress charts 3. Rules 4. Other behaviour • Academic topics 1. Behavior 2. Content specific 3. Procedures 4. Resources 5. Calendars/clocks 6. Other • Non-academic 1. Motivational slogans 2. Decorations 3. Decorative frames 4. Student art 5. Other non-academic 83
- 84. Clusters 84
- 85. Some examples Ferguson, R., Clow, D., Beale, R., Cooper, A. J., Morris, N., Bayne, S., & Woodgate, A. (2015). Moving Through MOOCS: Pedagogy, Learning Design and Patterns of Engagement. In Design for Teaching and Learning in a Networked World (pp. 70– 84). Springer International Publishing. https://doi.org/10.1007/978-3-319- 24258-3_6 85
- 86. Features Possible combinations: • 1 = Visited content only • 2 = Posted comment but visited no new content • 3 = Visited content and posted comment • 4 = Submitted the assessment late • 5 = Visited content and submitted assessment late • 6 = Posted late assessment, saw no new content • 7 = Visited content, posted, late assessment • 8 = Submitted assessment early /on time • 9 = Visited content, assessment early /on time • 10 = Posted, assessment early /on time, no new content • 11 = Visited, posted, assessment early /on time For each course week, we assigned learners an activity score: • 1 if they viewed content • 2 if they posted a comment • 4 if they submitted their assessment in a subsequent week • 8 if they submitted it early or on time • Adopted K-means 86
- 87. 1. Samplers 2. Strong Starters 3. Returners 4. Midway Dropouts 5. Nearly There 6. Late Completers 7. Keen Completers 87
- 88. Further examples Lust, G., Elen, J., & Clarebout, G. (2013). Regulation of tool-use within a blended course: Student differences and performance effects. Computers & Education, 60(1), 385–395. https://doi.org/10.1016/j.compedu.2012.09.001 Wise, A. F., Speer, J., Marbouti, F., & Hsiao, Y.-T. (2013). Broadening the notion of participation in online discussions: examining patterns in learners’ online listening behaviors. Instructional Science, 41(2), 323–343. https://doi.org/10.1007/s11251-012-9230-9 Niemann, K., Schmitz, H.-C., Kirschenmann, U., Wolpers, M., Schmidt, A., & Krones, T. (2012). Clustering by Usage: Higher Order Co- occurrences of Learning Objects. In Proceedings of the 2Nd International Conference on Learning Analytics and Knowledge (pp. 238–247). New York, NY, USA: ACM. https://doi.org/10.1145/2330601.2330659 Cobo, G., García-Solórzano, D., Morán, J. A., Santamaría, E., Monzo, C., & Melenchón, J. (2012). Using Agglomerative Hierarchical Clustering to Model Learner Participation Profiles in Online Discussion Forums. In Proceedings of the 2Nd International Conference on Learning Analytics and Knowledge (pp. 248–251). New York, NY, USA: ACM. https://doi.org/10.1145/2330601.2330660 Crossley, S., Roscoe, R., & McNamara, D. S. (2014). What Is Successful Writing? An Investigation into the Multiple Ways Writers Can Write Successful Essays. Written Communication, 31(2), 184–214. https://doi.org/10.1177/0741088314526354 Hecking, T., Ziebarth, S., & Hoppe, H. U. (2014). Analysis of Dynamic Resource Access Patterns in Online Courses. Journal of Learning Analytics, 1(3), 34–60. Li, N., Kidziński, Ł., Jermann, P., & Dillenbourg, P. (2015). MOOC Video Interaction Patterns: What Do They Tell Us? In Proceedings of the 10th European Conference on Technology Enhanced Learning (pp. 197–210). Springer International Publishing. https://doi.org/10.1007/978-3-319-24258-3_15 88
- 89. K-Means clustering • The most widely used clustering algorithm • Very simple, decent results • Produces “circular” clusters • Iterative algorithm • Initial position of cluster centroids random • Often done multiple times and results averaged out (e.g., 1,000 times) 89
- 90. K-Means algorithm 1. Pick the number of clusters K 2. Pick K centroids in the N-dimensional feature space 𝑐𝑖 𝑁 , 𝑖 ∈ 1 … 𝐾 3. For each of P data points 𝑝𝑖 𝑁 : 1. Calculate the distance to each of the K centroids 2. Assign it to its closest centroid 4. Recalculate centroid positions based on the assigned data points 5. Repeat steps 3–5 until centroid positions stabilize (i.e., there is no change in step 4) 90
- 92. K-Means characteristics • The final solution depends a lot on the original random centroid positions • The algorithm is often repeated (restarted) many times. • Restart K-means R (e.g., 1,000) times. • For each of the data points there will be R cluster assignments. • For each data point, pick the cluster assignment which was most common among R assignments 92
- 93. K-Means characteristics • The algorithm is easy to implement • Petty fast, converges very quickly • For N data points, requires calculation of N*K distances (which is 𝑂 𝑁 ) • Produces circular clusters – can be a problem in some domains • Susceptive to outliers: Each data point will be assigned to one of the centroids and can shift its centroid significantly “off side” • The number of clusters must be provided • Can be stuck in a local optima (solved often by multiple runs) 93
- 94. K-Means variants • K-Means++ • “Smart” picking of the initial centroids (a.k.a. seeds) • Seed selection algorithm: • Pick the first seed randomly (uniform distribution across the whole space) • Pick the next seed with a probability which is a squared distance from the closest seed • Effectively “spreads” the seed centroids across the feature space • K-Medoids & its flavours (Partitioning Around Medoids - PAM) • The solution to outlier problem: Instead of using centroid, use medoid. • Instead of representing clusters with centres, use existing data points to represents clusters 94
- 95. PAM algorithm (Partitioning Around Medoids) 1. One variant of K-Medoids 2. Pick the number of clusters K 3. Pick K data points in the N-dimensional feature space 𝑚𝑖 𝑁 , 𝑖 ∈ 1 … 𝐾 which will be initial cluster representatives 4. Assign each of remaining M-K data points 𝑝𝑖 𝑁 to the closest representative 5. For each representative point 𝑜𝑗: 1. Pick a random non-representative data point from its cluster 𝑜 𝑟𝑎𝑛𝑑𝑜𝑚 2. Check if swapping 𝑜𝑗 with 𝑜 𝑟𝑎𝑛𝑑𝑜𝑚 produces clusters with smaller “errors” (the sum of all clusters’ absolute differences between their data points and representatives) 3. If the new cost is smaller than the original cost, keep 𝑜 𝑟𝑎𝑛𝑑𝑜𝑚 as a representative point 6. Repeat steps 4–6 until there are no changes in representative objects 95
- 96. K-Means variants • X-Means • Does not require number of clusters K to be specified • Refines clustering solution by splitting existing clusters • Keeps the clustering configuration which maximizes AIC (Akaike information criterion) or BIC (Bayesian information criterion) • Implemented in WEKA • Cascading K-Means • Restarts K-means with different K and picks the K that maximizes Calinski and Harabasz criterion (F value in ANOVA) • Implemented in WEKA 96
- 97. K-Means variants • Large datasets variants: CLARA (Clustering LARge Applications) and CLARANS (Clustering Large Applications upon RANdomized Search) • CLARA: Use a sample of data points as potential candidate medoids and run PAM. • CLARANS Add randomization so the sample is not fixed at the start • Fuzzy C-means • Each data point can belong to multiple clusters with different probabilities (up to 100% for all clusters) • Also assesses the compactness of each cluster • Compact clusters will have members with high probabilities 97
- 98. Running K-means & X-Means in WEKA 98
- 99. Hierarchical clustering • Next to k-means, very popular method for cluster analysis • Two key flavours • agglomerative • divisive • Especially usable for small datasets • Evaluate and pick the number of clusters visually • The height of the merge/split indicates the distance • Used extensively in Learning Analytics • Many variants, using Linkage Functions 99
- 100. Agglomerative hierarchical clustering • Build the clusters from bottom-up • Algorithm: • Build a singleton cluster for each data point • Repeat until all data in a single cluster: • find two closest clusters (based on linkage function) • merge these two together • Run Interactive DEMO 100
- 101. Agglomerative hierarchical clustering • Requires calculation of the distances between all cluster pairs • At step 1 – this means calculating all pairwise distances among data points • N data points – N*N/2 distances • Not feasible for large datasets 101
- 102. Divisive hierarchical clustering • All data start in a single cluster, then we split one cluster at each step. • More complex than agglomerative (how to split a cluster?) • Less popular than agglomerative algorithms • Can be faster as we do not need to go all the way to the bottom of the dendrogram • Many approaches, often use “flat” algorithm as a partitioning method (e.g., K-means) 102
- 103. Example divisive clustering with K-means • Start with all data in a single cluster • Use K-means to create two initial clusters A2 and B2 • Use K-means to divide A2 into A2-1 and A2-2 • Use K-means to divide B2 into B2-1 and B2-2 • Pick between: • A2-1, A2-2, B2 • A2, B2-1, B2-2 • Call the best combination A3, B3, and C3 • Repeat the division of each cluster into two clusters. Pick between: • A3-1, A3-2, B3, C3 • A3, B3-1, B3-2, C3 • A3, B3, C3-1, C3-2 103
- 104. Linkage functions • Key question for agglomerative clustering: How to pick two clusters to merge • What is meant by “closest” • Several different criteria. Most popular • Single-linkage: Minimal distance between any two data points • Complete-linkage: Maximal distance between any two data points • Average-linkage: Distance between cluster centroids • Ward’s method: pick the pair of clusters so that the new cluster has minimal possible sum of squares of distances. Minimizes the variation within the clusters. 104
- 106. Linkage functions • Different linkage functions can produce very different results Single linkage Complete linkage Ward’s criteria 106
- 107. Hierarchical clustering in Weka FUN PART 107
- 108. Short intro to WEKA 108
- 109. What is Weka? Software “workbench” Waikato Environment for Knowledge Analysis (WEKA) 109
- 110. Installing Weka • https://www.cs.waikato.ac.nz/ml/weka/index.html • Very powerful, lot of resources available • Good for fast prototyping, much faster than R/Python • Can be used • Through GUI, which is very quirky and has hidden “gems” • From command line (useful for integrating with other tools/scripts) • As a Java library • Not the best designed UI, clearly done by the developers • Great book about ML/DM/Weka https://www.cs.waikato.ac.nz/ml/weka/book.html • Many demo datasets included in Weka https://www.cs.waikato.ac.nz/ml/weka/datasets.html 110
- 111. Weka Interfaces Will be used throughout the course Performance comparisons Graphical front - alternative to Explorer Unified interface Command-line interface 111
- 112. Weka Explorer 112
- 113. Weka Explorer 113
- 114. Attribute Relation File Format (ARFF) Data.CSV 5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, 3.0, 1.4, 0.2, Iris-setosa 4.7, 3.2, 1.3, 0.2, Iris-setosa 4.6, 3.1, 1.5, 0.2, Iris-setosa 5.0, 3.6, 1.4, 0.2, Iris-setosa Data.ARFF @RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris- versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 114
- 115. WEKA skills • Loading/Saving data (CSV ARFF) • Filtering and pre-processing attributes • Visualising data • Clustering • Saving clustering results 115
- 116. Lets play with WEKA 116
- 117. Selecting the number of clusters • Clustering is user-centric and subjective • How to pick the number of clusters? • Based on background knowledge (e.g., educational theory) • Use an algorithm that calculates optimal number of clusters automatically (e.g., EM) • Use an algorithm that provides a visual overview of all clustering configurations (e.g., hierarchical clustering) • Use supervised clustering algorithm where clustering process is guided by the user • Evaluate multiple values for K manually • “Elbow” method: trade-off point between number of clusters and within cluster variance • Silhouette method: test robustness of cluster membership 117
- 118. Elbow method • As K increases, the average diameter (variance) of clusters is also getting smaller • Find a “sweet spot” at which the decrease in variance sharply changes • Sometimes not so clear 118
- 119. Silhouette method • Visual method for determining the number of clusters • 𝑎 𝑖 – the average distance of point 𝑖 to all other points in its cluster • 𝑏(𝑖) – the smallest average distance of point 𝑖 to points in another cluster (distance to the closest neighbouring cluster) • 𝑠 𝑖 = 𝑏 𝑖 −𝑎(𝑖) max 𝑎 𝑖 ,𝑏(𝑖) • 𝑠 𝑖 = 𝑏 𝑖 −𝑎(𝑖) 𝑏(𝑖) , 𝑖𝑓𝑏 𝑖 > 𝑎(𝑖) 0, 𝑖𝑓 𝑏 𝑖 = 𝑎(𝑖) 𝑏 𝑖 −𝑎(𝑖) 𝑎(𝑖) , 𝑖𝑓𝑏 𝑖 < 𝑎(𝑖) 119
- 120. Silhouette method • 𝑠 𝑖 = 1 − 𝑎(𝑖) 𝑏(𝑖) , 𝑖𝑓𝑏 𝑖 > 𝑎(𝑖) 0, 𝑖𝑓 𝑏 𝑖 = 𝑎(𝑖) 𝑏 𝑖 𝑎(𝑖) − 1, 𝑖𝑓𝑏 𝑖 < 𝑎(𝑖) • 𝑠 𝑖 = 1, 𝑖𝑓𝑏 𝑖 ≫ 𝑎 𝑖 0, 𝑖𝑓 𝑏 𝑖 = 𝑎(𝑖) −1, 𝑖𝑓𝑏 𝑖 ≪ 𝑎(𝑖) • 𝑠 𝑖 = 1, 𝑔𝑜𝑜𝑑 𝑓𝑖𝑡 𝑡𝑜 𝑡ℎ𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 0, 𝑏𝑜𝑟𝑑𝑒𝑟𝑙𝑖𝑛𝑒 𝑓𝑖𝑡 −1, 𝑏𝑎𝑑 𝑓𝑖𝑡 𝑡𝑜 𝑡ℎ𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 120
- 121. K=2 example 121
- 122. K=3 example 122
- 123. K=4 example 123
- 124. K=5 example 124
- 125. K=6 example 125
- 126. Silhouette coefficient • The average 𝑠(𝑖) for all data points • Pick the number of clusters that maximizes the average silhouette coefficient 126
- 127. Silhouette coefficient and Elbow method in WEKA 127
- 128. Challenges and solutions • High dimensionality & feature (attribute) selection • Categorical attributes • “Weirdly-shaped” clusters • Outliers 128
- 129. Curse of dimensionality • Euclidean distance metric 𝑖 𝑎𝑖 − 𝑏𝑖 2 • In a highly-dimensional space with 𝑑 dimensions (0.0–1.0): • Highly likely that for at least for one feature 𝑖, the value 𝑎𝑖 − 𝑏𝑖 will be close to 1.0 • This puts the lower bound on the distance at 1.0 • However, the upper bound is 𝑑 • Most pairs of data points being far from this upper bound • Most data points have distance close to average distance • Many irrelevant dimensions for most clusters • The noise in irrelevant dimensions masks real differences between clusters 129
- 130. Curse of dimensionality: some solutions • Feature transformation methods (essentially compression) • Creating smaller number of new, synthetic features based on the larger number of input features which are then used for clustering • Principal component analysis (PCA) • Singular-value decomposition (SVD) • Feature (attribute) selection methods • Searching for a subset of features that are relevant for a given domain. • Minimize entropy • Idea that feature spaces that contain tight clusters have low entropy • Subspace clustering – extension of attribute selection 130
- 131. Curse of dimensionality: some solutions • Popular algorithms: • CLIQUE: A Dimension-Growth Subspace Clustering Method • Start with a single dimension and grow space by adding new dimensions • PROCLUS: A Dimension-Reduction Subspace Clustering Method • Starts with the complete high-dimensional space and assigns weight of each dimension for every cluster which are used to regenerate clusters • Explores dense subspace regions 131
- 132. Categorical data • Most clustering algorithms focus on clustering with continuous numerical attributes (ratio variables) • How to cluster categorical data? E.g., clustering students based on their demographic characteristics: • Gender • Program • Study level (postgraduate vs. undergraduate) • Domestic/international 132
- 133. Categorical data: simple solution • Ignore the problem, threat categorical data as numerical: • Male: 1, Female: 2 • Domestic: 1, International: 2 • Often does not produce good results. • Distance metric is not meaningful. • Point A: (Male, Domestic) • Point B: (Female, Domestic) • Point C: (Female, International) • Is point B closer to point A or point C? • Depends on the information value of these two features • “Localized method“ • If two distinct clusters have few points that are close, they might be merged together incorrectly. 133 C A B Gender Dom/Intl 1 2 1 2
- 134. Categorical data: custom algorithms • ROCK (RObust Clustering using linKs) • A Hierarchical Clustering Algorithm for Categorical Attributes • Two data points are similar is they have similar neighbours • Typical example: market basket data 134
- 135. “Weirdly-shaped” clusters • Most algorithms focus on distance between data points • However, often the connectedness of data points is also important • Different algorithms developed for these situations 135
- 136. Different types of clustering methods • Distance-based clustering • Group objects based on distance among them • Density-based clustering • Group objects based on area they occupy 136
- 137. CURE • Pick a subsample of data and cluster it using a method such as hierarchical clustering • Pick N characteristic points per each cluster that are most distant from each other. • Move representative points for a fraction towards cluster centroid. • Merge two clusters which have representative points sufficiently close. 137
- 138. DBSCAN • DBSCAN (Density-based spatial clustering of applications with noise) • Density-based algorithm • Searches for areas with large number of points • Implemented in WEKA • General idea: • Each data point is either core point, reachable point or outlier • Core points have minP (parameter) points around them in the radius r (parameter) • Reachable points are in radius r of a core point • Every other data point is an outlier 138 minP=4 Red: core Yellow: reachable Blue: outliers
- 139. Self-organising maps (SOM) • Special type of neural network • Used to learn the contour of the underlying data • Neuron laid out in a grid structure, each neuron connected with neighbours and all input nodes • For each data point, a neuron which is closest to it gets adjusted, with adjustments being propagated to neighbouring neurons • Over time, neurons will position themselves in the shape of the data • Dense areas with many neurons indicate clusters 139
- 141. Expectation-maximization (EM) clustering • Much more general than clustering • Used to estimate hidden (latent) parameters • Does not require number of clusters to be specified • General idea: • Pick number of clusters K • Fit K distributions over clustering variables with their parameters P • Estimate likelihood of all data points being generated by any of the K distributions (expectation) • For every data point, sum likelihoods of being generated by any of the K distributions • Combine weights with the data to produce new estimates for parameters P • Repeat until convergence is reached (no parameter change) 141
- 144. Analysis of cluster differences 144
- 145. Analysis of cluster differences • We can check the differences between clusters with regards to • Clustering variables (e.g., number of logins, number of discussion posts) • Some additional variables (e.g., student grades, age, gender) • We can examine difference • One variable at a time (univariate differences) • Across multiple variables simultaneously (multivariate differences) • Takes into consideration the interaction among multiple variables 145
- 146. Univariate analysis of cluster differences • For every variable we can use parametric and non-parametric univariate tests: • Two clusters: t-test and Mann-Whitney • Three or more clusters: One-way ANOVA and Kruskal-Wallis • Requires p-value adjustment (e.g., Bonferroni, Holm-Bonferroni correction) • Whether to use parametric or non-parametric primarily depends on the homogeneity (equality) of variance assumption • Can be tested with Levene’s test • If Levene’s test shows p<.05, use Mann-Whitney and Kruskal-Wallis • Significant ANOVAs tests can be followed by pairwise tests (e.g., TukeyHSD) • Significant Kruskal-Wallis tests can be followed by pairwise KW tests (with also p- value correction) 146
- 147. Multivariate analysis of cluster differences • We can test differences across all variables at the same time • More holistic than ANOVA/KW • Instead of one dependent variable, we can have multiple variables • Use meaningful groups of variables (e.g., behavioural variables) • MANOVA: Multivariate analysis of variance • Step “before” ANOVAs/KWs • Has several statistics: Wilk’s Λ, Pillai’s Trace • Assumption: Homogeneity of covariance • Much trickier to test: Box’s M one method, but it is very sensitive (use p<.001) • Use Levene’s tests on each of the variables (doesn’t guarantee homogeneity of covariance but might help) • If assumption is violated, still can be used but with more robust metric (Pillai’s Trace) 147
- 148. Example MANOVA “For assessing the difference between student clusters a multivariate analysis of variance (MANOVA) was used. To validate the difference between the discovered clusters a MANOVA model with cluster assignment as a single independent variable and thirteen clustering variables as the dependent measures was constructed…” Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities of inquiry: Effects of learning technology use on cognitive presence in asynchronous online discussions. The Internet and Higher Education, 27, 74–89. doi:10.1016/j.iheduc.2015.06.002 148
- 149. Example MANOVA “Before running MANOVAs, … the homogeneity of covariances assumption was checked using Box’s M test and homogeneity of variances using Levine’s test. To protect from the assumption violations, we log-transformed the data and used the Pillai’s trace statistic which is considered to be a robust against assumption violations. As a final protection measure, obtained MANOVA results were compared with the results of the robust rank-based variation of the MANOVA analysis” Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities of inquiry: Effects of learning technology use on cognitive presence in asynchronous online discussions. The Internet and Higher Education, 27, 74–89. doi:10.1016/j.iheduc.2015.06.002 149
- 150. Example MANOVA “The assumption of homogeneity of covariances was tested using Box’s M test which was not accepted. Thus, Pillai’s trace statistic was used, as it is more robust to the assumption violations together with the Bonferroni correction method. A statistically significant MANOVA effect was obtained, Pillai’s Trace = 1.62, F(39, 174) = 5.28, p < 10-14” Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities of inquiry: Effects of learning technology use on cognitive presence in asynchronous online discussions. The Internet and Higher Education, 27, 74–89. doi:10.1016/j.iheduc.2015.06.002 150
- 151. MANOVA Follow-up analyses • Significant MANOVA can be followed-up with two types of analyses: • Individual ANOVAs/KWs (with p-value correction) • Which in turn can be followed with pairwise analyses: TukeyHSD/Pairwise KWs • Discriminatory Factor Analysis (DFA) • What combinations of variables differentiate between clusters • DFA can be run alone (without MANOVA) but its significance then can’t be tested 151
- 152. 152 Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities of inquiry: Effects of learning technology use on cognitive presence in asynchronous online discussions. The Internet and Higher Education, 27, 74–89. doi:10.1016/j.iheduc.2015.06.002
- 153. Example DFA analysis 153 Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities of inquiry: Effects of learning technology use on cognitive presence in asynchronous online discussions. The Internet and Higher Education, 27, 74–89. doi:10.1016/j.iheduc.2015.06.002
- 154. Data from OU UK 154