Unsupervised Learning for Learning Analytics Researchers

UNSUPERVISED MACHINE
LEARNING
VITOMIR KOVANOVIĆ
UNIVERSITY OF SOUTH AUSTRALIA
#vkovanovic
vitomir.kovanovic.info
Vitomir.Kovanovic@unisa.edu.au
LEARNING ANALYTICS SUMMER INSTITUTE
TEACHERS COLLEGE, COLUMBIA UNIVERSITY
JUNE 11-13, 2018
SREĆKO JOKSIMOVIĆ
UNIVERSITY OF SOUTH AUSTRALIA
#s_joksimovic
www.sjoksimovic.info
Srecko.Joksimovic@unisa.edu.au
1

About me
• Learning analytics researcher
• Research Fellow, School of Education, UniSA
Data Scientist, Teaching Innovation Unit, UniSA
Member of the Centre for Change and Complexity in Learning (C3L)
• Member of the SoLAR executive board
• Computer science and information systems background
• Used cluster analysis in several research projects
2

About you
• Introduce yourself
• Name, affiliation, position
• Experience with machine learning and clustering
• Experience with Weka or some other ML/DM toolkit
• Ideas for clustering in your own research/work
3

Download data from Dropbox or USB
http://bit.ly/lak18ul
4

Workshop outline
1. Three days, four sessions
2. Equally theoretical and practical
3. Use of Weka Machine Learning toolkit
4. Focus on practical use
5. Examples of clustering use in learning analytics
5

Workshop topics
• Introduction to machine learning & unsupervised methods
• Introduction to cluster analysis
• Overview of cluster analysis use in Learning Analytics
• Introduction to WEKA toolkit
• Overview of the tutorial dataset
• K-means algorithm
• K-means demo
• Hierarchical clustering algorithms
• Hierarchical clustering demo
6

Tutorial topics
• How to choose the number of clusters
• How to interpret clustering results
• Practical challenges
• More advanced cluster analysis approaches
• Statistical methods for comparing clusters
• Clustering real-world data from OU UK
• Discussing different cluster analysis methods
7

What is machine learning?
Computing method for making sense of the data
10

Data is everywhere
Each minute:
● 3,600,000 Google searches
● 456,000 Twitter posts
● 46,740 Instagram photos
● 45,787 Uber trips
● 600 new Wikipedia edits
● 13 new Spotify songs
Domo (2017). “Data Never Sleeps 5.0”
https://www.domo.com
11

What products should go on sale?
Grouping of related products
12

What movies to recommend?
Grouping users based on their viewing preferences
13

How to navigate streets?
Processing multiple streams of information in real time
14

Online store product recommendation
15

Fields that influenced machine learning
• Statistics
• Operations research
• Artificial intelligence
• Data visualisation
• Software engineering
• Information systems management
16

How machine learning works?
WILL IT TAKE MY JOB?
17

Two key ideas in machine learning
1.Features
2.Models
18

What is a feature?
1. A feature is a characteristic of a data point
2. Each data point is represented as a vector of features [f1, f2, f3 ... fm]
3. A whole dataset of N data points is represented as a N x M matrix
Data point Feature 1 Feature 2 .... Feature M
Data point 1
Data point 2
....
Data point N
19

What is a feature?
• Performance of machine learning algorithms in large part depend on the
quality of extracted features (how useful they are for a given ML task)
• Expertise and prior knowledge come into play when deciding which features
to extract
20

What is a model?
• Something that capture important patterns in the data
• A model can be used to
• Draw inferences
• Understand the data
• Learn hidden rules
• Support decision making
21

An example model: BMI calculator
• Goal: Predicting a person’s body fat category (overweight, normal, or underweight)
from his height (in m) and weight (in kg).
• Model:
• BMI = weight / height2
• If BMI > 25: overweight
• If BMI < 18.5: underweight
• Otherwise: normal
• An example: 1.75m and 70kg:
BMI: 70/(1.75*1.75) = 22.85 -> Normal category
22

ML Model
Slow and hard
Model development
Model use
Response
(Prediction)
A new
data point
Fast and easy
ML Model
Model
buildingN data
points
feature
extraction
NxM
feature
matrix
feature
extraction
feature
vector of
length M
23

Two types of errors
• Bias: The error from erroneous
assumptions of the model.
• High bias: miss the relevant
relationships between variables
(underfitting).
• Variance: The error from sensitivity to
small fluctuations in the data.
• High variance: modelling the
random noise in the data, rather
than real relationships (overfitting).
24

Two types of errors
• We always work with samples
• Samples always have noise
• The trick is to develop models that do not
fit training data, but new future data
25

Two types of errors
26
High bias High variance

The trick is to find optimal model complexity
27

Key machine learning approaches
1. Supervised machine learning
1. Predicting categorical value: Classification
2. Predicting continuous value: Regression
2. Unsupervised machine learning
1. Grouping data points (rows): Cluster analysis
2. Grouping features (columns): Principal Component Analysis (PCA), Factor
Analysis (FA), Latent semantic analysis (LSA), Singular Value Decomposition (SVD)
28

Many more approaches
• Models that blur the division between supervised and unsupervised
• Reinforcement learning: learning the class label after making a prediction
• Neural networks (can be supervised and unsupervised)
• Online learning models: learning as data arrives
• Feature processing methods: association rule mining
29

Supervised learning example
NOT A GRAD SCHOOL
30

10 data points
Data point 1
Data point 2
Data point 3
Data point 4
Data point 5
Data point 6
Data point 7
Data point 8
Data point 9
Data point 10
31

ML Model
Slow and hard
Model development
Model use
Response
(Prediction)
A new
data point
Fast and easy
ML Model
Model
buildingN data
points
feature
extraction
NxM feature matrix
feature
extraction
feature vector of length M
32

First step: feature extraction
• From each data point we
extracted four features:
• Number of wheels
• Colour
• Top speed (in km/h)
• Weight (in kg)
• Our feature matrix is 10 x 4
ID Wheels Color Top speed
(km/h)
Weight
(kg)
1 4 Yellow 220 1,200
2 4 Red 180 950
3 2 Blue 260 230
4 2 Red 210 320
5 4 Yellow 160 870
6 4 Blue 170 750
7 4 Red 190 850
8 2 Yellow 140 140
9 2 Yellow 210 310
10 2 Red 240 280
33

Supervised learning: classification
• Each data point is provided with
a continuous numerical label
(outcome variable)
• The goal is to predict the class
label for a new data point
(km/h)
Weight
(kg)
Label
1 4 Yellow 220 1,200 Car
2 4 Red 180 950 Car
3 2 Blue 260 230 Bike
4 2 Red 210 320 Bike
5 4 Yellow 160 870 Car
6 4 Blue 170 750 Car
7 4 Red 190 850 Car
8 2 Yellow 140 140 Bike
9 2 Yellow 210 310 Bike
10 2 Red 240 280 Bike
[4, Yellow, 260, 1100]
?Car
(km/h)
Weight
(kg)
1 4 Yellow 220 1,200
2 4 Red 180 950
3 2 Blue 260 230
4 2 Red 210 320
5 4 Yellow 160 870
6 4 Blue 170 750
7 4 Red 190 850
8 2 Yellow 140 140
9 2 Yellow 210 310
10 2 Red 240 280
We learned a model to classify a new
(unseen) vehicle as either a car or a bike
34

Supervised learning: regression
• The goal is to predict the
outcome value for a new
data point
[4, Yellow, 260, 1100]
?140,000
(km/h)
Weight
(kg)
Label
1 4 Yellow 220 1,200 120,000
2 4 Red 180 950 40,000
3 2 Blue 260 230 63,000
4 2 Red 210 320 53,000
5 4 Yellow 160 870 21,000
6 4 Blue 170 750 37,000
7 4 Red 190 850 21,000
8 2 Yellow 140 140 26,000
9 2 Yellow 210 310 68,000
10 2 Red 240 280 75,000
(km/h)
Weight
(kg)
1 4 Yellow 220 1,200
2 4 Red 180 950
3 2 Blue 260 230
4 2 Red 210 320
5 4 Yellow 160 870
6 4 Blue 170 750
7 4 Red 190 850
8 2 Yellow 140 140
9 2 Yellow 210 310
10 2 Red 240 280
We learned a model to predict a price of a
new (unseen) vehicle
35

Unsupervised learning example
SAME BUT DIFFERENT
36

Unsupervised learning: clustering
• We want algorithm to group data
points into several groups based
on their similarity
(km/h)
Weight
(kg)
Group
1 4 Yellow 220 1,200 ?
2 4 Red 180 950 ?
3 2 Blue 260 230 ?
4 2 Red 210 320 ?
5 4 Yellow 160 870 ?
6 4 Blue 170 750 ?
7 4 Red 190 850 ?
8 2 Yellow 140 140 ?
9 2 Yellow 210 310 ?
10 2 Red 240 280 ?
(km/h)
Weight
(kg)
1 4 Yellow 220 1,200
2 4 Red 180 950
3 2 Blue 260 230
4 2 Red 210 320
5 4 Yellow 160 870
6 4 Blue 170 750
7 4 Red 190 850
8 2 Yellow 140 140
9 2 Yellow 210 310
10 2 Red 240 280
[4, Yellow, 260, 1100]
?1
(km/h)
Weight
(kg)
Group
1 4 Yellow 220 1,200 1
2 4 Red 180 950 1
3 2 Blue 260 230 2
4 2 Red 210 320 2
5 4 Yellow 160 870 1
6 4 Blue 170 750 1
7 4 Red 190 850 1
8 2 Yellow 140 140 2
9 2 Yellow 210 310 2
10 2 Red 240 280 2
Interpretation of group meaning is up to the
researcher
1=?, 2=?
37

Unsupervised learning: clustering
• We want algorithm to group data
points into several groups based
on their similarity
(km/h)
Weight
(kg)
Group
1 4 Yellow 220 1,200 ?
2 4 Red 180 950 ?
3 2 Blue 260 230 ?
4 2 Red 210 320 ?
5 4 Yellow 160 870 ?
6 4 Blue 170 750 ?
7 4 Red 190 850 ?
8 2 Yellow 140 140 ?
9 2 Yellow 210 310 ?
10 2 Red 240 280 ?
(km/h)
Weight
(kg)
1 4 Yellow 220 1,200
2 4 Red 180 950
3 2 Blue 260 230
4 2 Red 210 320
5 4 Yellow 160 870
6 4 Blue 170 750
7 4 Red 190 850
8 2 Yellow 140 140
9 2 Yellow 210 310
10 2 Red 240 280
[4, Yellow, 260, 1100]
?2
(km/h)
Weight
(kg)
Group
1 4 Yellow 220 1,200 2
2 4 Red 180 950 1
3 2 Blue 260 230 2
4 2 Red 210 320 2
5 4 Yellow 160 870 1
6 4 Blue 170 750 1
7 4 Red 190 850 1
8 2 Yellow 140 140 1
9 2 Yellow 210 310 2
10 2 Red 240 280 2
Pick the grouping of data that is most useful
for your own purpose
38

Introduction to cluster analysis
WE NEED TO START SOMEWHERE
39

Homogeneity
Cluster Distance
Unsupervised categorization:
no predefined classes
What is Cluster Analysis?
40

History
• Anthropology
• Biology
• Computer Science
• Statistics
• Mathematics
• Medicine
• Psychology
• Engineering
41

Primary goals
Gain insights Categorize Compress
42

Social Sciences
• Improve understanding of a domain
• Compress and summarize large datasets
• Within Learning Analytics:
• Profile learners based on their course engagement,
• Discover emerging topics in a corpus (student discussions, course materials)
• Group courses based on their characteristics
43

Example Applications (general)
44

Medicine and genetics
Clustering patients, symptoms, gene expressions
45

Marketing and customer analysis
Understanding customer populations
46

Newspapers and document analysis
Grouping related news articles and summarizing large collections of documents
47

Earthquake prediction
Identifying sources of earthquakes
48

Urban planning
Grouping building based on their properties
49

What is not clustering?
• Simple data partitioning
• Single property
• Predefined groups
• Data clustering
• Multiple properties
• Unforeseen groups
• Combinations of properties describe
groups
50

Important concepts
CLUSTERING IS TRICKY BUSINESS
51

Cluster ambiguity
• How many clusters?
52

Cluster ambiguity
Two-cluster solution
53

Cluster ambiguity
Four-cluster solution
54

Cluster ambiguity
Six-cluster solution
55

Cluster separation and stability
57

Representing a cluster
• Centroid – a geometrical centre of a cluster • Medoid – data point closest to the centroid
58

What is mean by similar?
• What is meant by “similar data points”?
• Geometry – More similar data points are closer to each other in N-
dimensional feature space
• Yes, but:
• Close to the cluster “centre”
• Closeness to any other data point in a cluster
• Is it about distance between data points or their special density?
59

Types of clustering approaches
THERE ARE CLUSTERS OF CLUSTERING METHODS
61

Different types of clustering methods
• Membership strictness
• Hard clustering
• Each object either belongs to a cluster
or not
• Soft (fuzzy) clustering
• Each object belongs to each cluster to
some degree
62

• Membership exclusivity
• Strict partitioning clustering (e.g. K-means)
• Each object belongs to one and only one
cluster
• Strict partitioning clustering with outliers
• Each objects belongs to zero or one
clusters
63

• Overlapping clustering
• Each object can belong to one or more
“hard” clusters
• Hierarchical clustering
• Objects belonging to a child cluster also
belong to the parent cluster
64

• Distance-based clustering
• Group objects based on distance
among them
• Density-based clustering
• Group objects based on area they
occupy
65

Special clustering approaches
MAAANY more approaches
• Model-based clustering:
• EM clustering
• Neural network approaches – Self-organising maps
• Grid-based approaches (e.g., STING)
• Clustering algorithms for large datasets
• Clustering of stream data in real time
• Clustering (partitioning) approaches for different types of data (e.g., graphs)
• Clustering approaches for categorical data
• Clustering approaches for freeform clusters (e.g., CURE)
• Clustering approaches for high-dimensional data (e.g., CLIQUE, PROCLUS)
• Constraint-based clustering
• Semi-supervised clustering
66

Multivariate methods
• N Data points have M features
• Find K clusters so that
• Each data point is associated to
each of the K clusters to a certain
degree (0 – none, 1.0 – fully)
• Each of the K clusters is
associated with all M features to
a certain degree
• Find K which maximizes the
likelihood of the observed data
67

Neural network approaches
• Network of connected nodes that propagate signals
• Edges have coefficients that alter signal propagation
• Traditionally supervised learning method
• Backpropagation method of learning coefficients
• Learning method and network structure altered to
support unsupervised learning
• Nodes can move!
• Eventually position of nodes indicate location of
clusters
68

Graph partitioning
• Partitioning network into subgraphs
• Goal to have highly dense subgraphs with
few connections between them
69

Popular distance metric
• Way of calculating
similarity between
different data points.
• Important for methods
based on distances (e.g.,
K-Means, Hierarchical
clustering)
• Have a significant effect
on the final clustering
results
Distance metric Formula
Euclidean distance
𝑎 − 𝑏 2 =
𝑖
𝑎𝑖 − 𝑏𝑖
2
Squared Euclidean distance
𝑎 − 𝑏 2
2
=
𝑖
2
Manhattan (Hammington)
distance
𝑎 − 𝑏 1 =
𝑖
Maximum distance 𝑎 − 𝑏 ∞ = max
𝑖
70

Distance metrics example
• Euclidean: 42 + 32 = 5
• Square Euclidean: 42
+ 32
= 25
• Manhattan: 4 + 3 = 7
• Maximum: max 4, 3 = 4
3
4
71

Clustering for Learning Analytics
• Grouping of:
• Students
• Demographics
• Behavior
• Preferences
• Courses taken
• Academic performance
• Resources
• Reading materials
• Discussions
• Courses
• Course design improvement
72

Some examples
Kovanović, V., Joksimović, S., Gašević, D., Owers, J., Scott, A.-M., & Woodgate, A.
(2016). Profiling MOOC course returners: How does student behaviour change
between two course enrolments? In Proceedings of the Third ACM Conference
on Learning @ Scale (L@S’16) (pp. 269–272). New York, NY, USA: ACM.
https://doi.org/10.1145/2876034.2893431
73

Dataset
• 28 offerings of 11 different
Coursera MOOCs at the University
of Edinburgh
• 26,025 double course enrolment
records
• 52,050 course enrolment records
• K-means clustering
• Too large for clustering methods
that use pairwise distances (e.g.,
hierarchical clustering)
# Course Offering
1 Artificial Intelligence Planning 1,2,3
2 Animal Behavior and Welfare 1,2
3 AstroTech: The Science and Technology behind
Astronomical Discovery
1,2
4 Astrobiology and the Search for Extraterrestrial Life 1,2
5 The Clinical Psychology of Children and Young People 1,2
6 Critical Thinking in Global Challenges 1,2,3
7 E-learning and Digital Cultures 1,2,3
8 EDIVET: Do you have what it takes to be a veterinarian? 1,2
9 Equine Nutrition 1,2,3
10 Introduction to Philosophy 1,2,3,4
11 Warhol 1,2
74

Extracted features
• 9 different features extracted
Feature Description
Days No. of days active
Sub. No. of submitted assignments
Wiki No. of wiki page views
Disc. No. of discussion views
Posts No. of discussion messages written
Quiz. No. of quizzes attempted
Quiz.Uni. No. of different quizzes attempted
Vid.Uni. No. of different videos watched
Vid. No. of videos watched
75

Results
Cluster label Students Students
%
Enrol only (E) 22,932 44.1
Low engagement (LE) 21,776 41.8
Videos & Quizzes (VQ) 2,120 4.1
Videos (V) 5,128 9.9
Social (S) 94 0.2
77

Some examples
Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015).
Analytics of communities of inquiry: Effects of learning technology use on
cognitive presence in asynchronous online discussions. The Internet and Higher
Education, 27, 74–89. https://doi.org/10.1016/j.iheduc.2015.06.002
78

Clustering features
# Type Code
1 Clustering variables
(content)
ULC UserLoginCount Total number of times student logged into the system.
2 CVC CourseViewCount Total number of times student viewed general course information.
3 AVT AssignmentViewTime Total time spent on all course assignments.
4 AVC AssignmentViewCount Total number of times student opened one of the course assignments.
5 RVT ResourceViewTime Total time spent on reading the course resources.
6 RVC ResourceViewCount Total number of times student opened one of the course resource materials.
7 Clustering variables
(discussion)
FSC ForumSearchCount Total number of times student used search function on the discussion boards.
8 DVT DiscussionViewTime Total time spent on viewing course’s online discussions.
9 DVC DiscussionViewCount Total number of time student opened one of the course’s online discussions.
10 APT AddPostTime Total time spent on posting discussion board messages.
11 APC AddPostCount Total number of the discussion board messages posted by the student.
12 UPT UpdatePostTime Total time spent on updating one of his discussion board messages.
13 UPC UpdatePostCount Total number of times student updated one of his discussion board messages.
79

Cluster interpretations
# Size Label
1 21 Task-focused users Overall below average activity,
Above average message posting activity
2 15 Content-focused users Below average discussions-related activity,
Average content-related activity, emphasis on assignments
3 22 No-users Overall below average activity,
slightly bigger in discussion-related activities
4 3 Highly intensive users Significantly most active students,
especially in content-related activities
5 6 Content-focused intensive users Above average content-related activity,
Average discussion-related activity
6 14 Socially-focused intensive users Above average discussion-related activity,
Average content-related activity
81

Some examples
Almeda, M. V., Scupelli, P., Baker, R. S., Weber, M., & Fisher, A. (2014). Clustering
of Design Decisions in Classroom Visual Displays. In Proceedings of the Fourth
International Conference on Learning Analytics and Knowledge (pp. 44–48). New
York, NY, USA: ACM. https://doi.org/10.1145/2567574.2567605
82

Clustering visual designs of classrooms
• 30 schools in northwestern USA
• Classroom Wall Coding System, CWaCS 1.0.
• Each classroom wall was photographed
• Units of analysis were marked with a box
• Coding scheme:
1. Academic
1. Academic topics (F1)
2. Academic organizational (F2)
2. Non-academic (F3)
3. Behavioural (F4)
• Adopted K-means to cluster classrooms
based on frequency of four features (F1-F4)
• Academic organizational
1. Goals for the day
2. Group assignments
3. Job charts
4. Labels
5. Schedule day/week
6. Yearly
7. Schedule
8. Skills
9. Homework
• Behavior materials
1. Behavior management
2. Progress charts
3. Rules
4. Other behaviour
• Academic topics
1. Behavior
2. Content specific
3. Procedures
4. Resources
5. Calendars/clocks
6. Other
• Non-academic
1. Motivational slogans
2. Decorations
3. Decorative frames
4. Student art
5. Other non-academic
83

Some examples
Ferguson, R., Clow, D., Beale, R., Cooper, A. J., Morris, N., Bayne, S., & Woodgate,
A. (2015). Moving Through MOOCS: Pedagogy, Learning Design and Patterns of
Engagement. In Design for Teaching and Learning in a Networked World (pp. 70–
84). Springer International Publishing. https://doi.org/10.1007/978-3-319-
24258-3_6
85

Features
Possible combinations:
• 1 = Visited content only
• 2 = Posted comment but visited no new content
• 3 = Visited content and posted comment
• 4 = Submitted the assessment late
• 5 = Visited content and submitted assessment late
• 6 = Posted late assessment, saw no new content
• 7 = Visited content, posted, late assessment
• 8 = Submitted assessment early /on time
• 9 = Visited content, assessment early /on time
• 10 = Posted, assessment early /on time, no new content
• 11 = Visited, posted, assessment early /on time
For each course week, we assigned learners
an activity score:
• 1 if they viewed content
• 2 if they posted a comment
• 4 if they submitted their assessment in a
subsequent week
• 8 if they submitted it early or on time
• Adopted K-means
86

1. Samplers
2. Strong Starters
3. Returners
4. Midway Dropouts
5. Nearly There
6. Late Completers
7. Keen Completers
87

Further examples
Lust, G., Elen, J., & Clarebout, G. (2013). Regulation of tool-use within a blended course: Student differences and performance
effects. Computers & Education, 60(1), 385–395. https://doi.org/10.1016/j.compedu.2012.09.001
Wise, A. F., Speer, J., Marbouti, F., & Hsiao, Y.-T. (2013). Broadening the notion of participation in online discussions: examining
patterns in learners’ online listening behaviors. Instructional Science, 41(2), 323–343. https://doi.org/10.1007/s11251-012-9230-9
Niemann, K., Schmitz, H.-C., Kirschenmann, U., Wolpers, M., Schmidt, A., & Krones, T. (2012). Clustering by Usage: Higher Order Co-
occurrences of Learning Objects. In Proceedings of the 2Nd International Conference on Learning Analytics and Knowledge (pp.
238–247). New York, NY, USA: ACM. https://doi.org/10.1145/2330601.2330659
Cobo, G., García-Solórzano, D., Morán, J. A., Santamaría, E., Monzo, C., & Melenchón, J. (2012). Using Agglomerative Hierarchical
Clustering to Model Learner Participation Profiles in Online Discussion Forums. In Proceedings of the 2Nd International Conference
on Learning Analytics and Knowledge (pp. 248–251). New York, NY, USA: ACM. https://doi.org/10.1145/2330601.2330660
Crossley, S., Roscoe, R., & McNamara, D. S. (2014). What Is Successful Writing? An Investigation into the Multiple Ways Writers Can
Write Successful Essays. Written Communication, 31(2), 184–214. https://doi.org/10.1177/0741088314526354
Hecking, T., Ziebarth, S., & Hoppe, H. U. (2014). Analysis of Dynamic Resource Access Patterns in Online Courses. Journal of Learning
Analytics, 1(3), 34–60.
Li, N., Kidziński, Ł., Jermann, P., & Dillenbourg, P. (2015). MOOC Video Interaction Patterns: What Do They Tell Us? In Proceedings of
the 10th European Conference on Technology Enhanced Learning (pp. 197–210). Springer International Publishing.
https://doi.org/10.1007/978-3-319-24258-3_15
88

K-Means clustering
• The most widely used clustering algorithm
• Very simple, decent results
• Produces “circular” clusters
• Iterative algorithm
• Initial position of cluster centroids random
• Often done multiple times and results averaged out (e.g., 1,000 times)
89

K-Means algorithm
1. Pick the number of clusters K
2. Pick K centroids in the N-dimensional feature space 𝑐𝑖
𝑁
, 𝑖 ∈ 1 … 𝐾
3. For each of P data points 𝑝𝑖
𝑁
:
1. Calculate the distance to each of the K centroids
2. Assign it to its closest centroid
4. Recalculate centroid positions based on the assigned data points
5. Repeat steps 3–5 until centroid positions stabilize (i.e., there is no change in step 4)
90

K-Means characteristics
• The final solution depends a lot on the original random centroid positions
• The algorithm is often repeated (restarted) many times.
• Restart K-means R (e.g., 1,000) times.
• For each of the data points there will be R cluster assignments.
• For each data point, pick the cluster assignment which was most common among
R assignments
92

K-Means characteristics
• The algorithm is easy to implement
• Petty fast, converges very quickly
• For N data points, requires calculation of N*K distances (which is 𝑂 𝑁 )
• Produces circular clusters – can be a problem in some domains
• Susceptive to outliers: Each data point will be assigned to one of the centroids and
can shift its centroid significantly “off side”
• The number of clusters must be provided
• Can be stuck in a local optima (solved often by multiple runs)
93

K-Means variants
• K-Means++
• “Smart” picking of the initial centroids (a.k.a. seeds)
• Seed selection algorithm:
• Pick the first seed randomly (uniform distribution across the whole space)
• Pick the next seed with a probability which is a squared distance from the closest seed
• Effectively “spreads” the seed centroids across the feature space
• K-Medoids & its flavours (Partitioning Around Medoids - PAM)
• The solution to outlier problem: Instead of using centroid, use medoid.
• Instead of representing clusters with centres, use existing data points to represents
clusters
94

PAM algorithm (Partitioning Around
Medoids)
1. One variant of K-Medoids
2. Pick the number of clusters K
3. Pick K data points in the N-dimensional feature space 𝑚𝑖
𝑁
, 𝑖 ∈ 1 … 𝐾 which will be initial cluster
representatives
4. Assign each of remaining M-K data points 𝑝𝑖
𝑁
to the closest representative
5. For each representative point 𝑜𝑗:
1. Pick a random non-representative data point from its cluster 𝑜 𝑟𝑎𝑛𝑑𝑜𝑚
2. Check if swapping 𝑜𝑗 with 𝑜 𝑟𝑎𝑛𝑑𝑜𝑚 produces clusters with smaller “errors” (the sum of all clusters’
absolute differences between their data points and representatives)
3. If the new cost is smaller than the original cost, keep 𝑜 𝑟𝑎𝑛𝑑𝑜𝑚 as a representative point
6. Repeat steps 4–6 until there are no changes in representative objects
95

K-Means variants
• X-Means
• Does not require number of clusters K to be specified
• Refines clustering solution by splitting existing clusters
• Keeps the clustering configuration which maximizes AIC (Akaike information
criterion) or BIC (Bayesian information criterion)
• Implemented in WEKA
• Cascading K-Means
• Restarts K-means with different K and picks the K that maximizes Calinski and
Harabasz criterion (F value in ANOVA)
96

K-Means variants
• Large datasets variants: CLARA (Clustering LARge Applications) and CLARANS (Clustering Large
Applications upon RANdomized Search)
• CLARA: Use a sample of data points as potential candidate medoids and run PAM.
• CLARANS Add randomization so the sample is not fixed at the start
• Fuzzy C-means
• Each data point can belong to multiple clusters with different probabilities (up to 100% for
all clusters)
• Also assesses the compactness of each cluster
• Compact clusters will have members with high probabilities
97

Running K-means & X-Means in WEKA
98

Hierarchical clustering
• Next to k-means, very popular method for cluster analysis
• Two key flavours
• agglomerative
• divisive
• Especially usable for small datasets
• Evaluate and pick the number of clusters visually
• The height of the merge/split indicates the distance
• Used extensively in Learning Analytics
• Many variants, using Linkage Functions
99

Agglomerative hierarchical clustering
• Build the clusters from bottom-up
• Algorithm:
• Build a singleton cluster for each data point
• Repeat until all data in a single cluster:
• find two closest clusters (based on linkage function)
• merge these two together
• Run Interactive DEMO
100

Agglomerative hierarchical clustering
• Requires calculation of the distances between all cluster pairs
• At step 1 – this means calculating all pairwise distances among data points
• N data points – N*N/2 distances
• Not feasible for large datasets
101

Divisive hierarchical clustering
• All data start in a single cluster, then we split one cluster at each step.
• More complex than agglomerative (how to split a cluster?)
• Less popular than agglomerative algorithms
• Can be faster as we do not need to go all the way to the bottom of the dendrogram
• Many approaches, often use “flat” algorithm as a partitioning method (e.g., K-means)
102

Example divisive clustering with K-means
• Start with all data in a single cluster
• Use K-means to create two initial clusters A2 and B2
• Use K-means to divide A2 into A2-1 and A2-2
• Use K-means to divide B2 into B2-1 and B2-2
• Pick between:
• A2-1, A2-2, B2
• A2, B2-1, B2-2
• Call the best combination A3, B3, and C3
• Repeat the division of each cluster into two clusters. Pick between:
• A3-1, A3-2, B3, C3
• A3, B3-1, B3-2, C3
• A3, B3, C3-1, C3-2
103

Linkage functions
• Key question for agglomerative clustering: How to pick two clusters to merge
• What is meant by “closest”
• Several different criteria. Most popular
• Single-linkage: Minimal distance between any two data points
• Complete-linkage: Maximal distance between any two data points
• Average-linkage: Distance between cluster centroids
• Ward’s method: pick the pair of clusters so that the new cluster has minimal possible sum
of squares of distances. Minimizes the variation within the clusters.
104

Linkage functions visualisations
105

Linkage functions
• Different linkage functions can produce very different results
Single linkage Complete linkage Ward’s criteria
106

Hierarchical clustering in Weka
FUN PART
107

What is Weka?
Software “workbench”
Waikato Environment for Knowledge Analysis
(WEKA)
109

Installing Weka
• https://www.cs.waikato.ac.nz/ml/weka/index.html
• Very powerful, lot of resources available
• Good for fast prototyping, much faster than R/Python
• Can be used
• Through GUI, which is very quirky and has hidden “gems”
• From command line (useful for integrating with other tools/scripts)
• As a Java library
• Not the best designed UI, clearly done by the developers
• Great book about ML/DM/Weka
https://www.cs.waikato.ac.nz/ml/weka/book.html
• Many demo datasets included in Weka
https://www.cs.waikato.ac.nz/ml/weka/datasets.html
110

Weka Interfaces
Will be used throughout the course
Performance comparisons
Graphical front - alternative to Explorer
Unified interface
Command-line interface
111

Attribute Relation File Format (ARFF)
Data.CSV
5.1, 3.5, 1.4, 0.2, Iris-setosa
4.9, 3.0, 1.4, 0.2, Iris-setosa
4.7, 3.2, 1.3, 0.2, Iris-setosa
4.6, 3.1, 1.5, 0.2, Iris-setosa
5.0, 3.6, 1.4, 0.2, Iris-setosa
Data.ARFF
@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-
versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
114

WEKA skills
• Loading/Saving data (CSV ARFF)
• Filtering and pre-processing attributes
• Visualising data
• Clustering
• Saving clustering results
115

Selecting the number of clusters
• Clustering is user-centric and subjective
• How to pick the number of clusters?
• Based on background knowledge (e.g., educational theory)
• Use an algorithm that calculates optimal number of clusters automatically (e.g., EM)
• Use an algorithm that provides a visual overview of all clustering configurations (e.g.,
hierarchical clustering)
• Use supervised clustering algorithm where clustering process is guided by the user
• Evaluate multiple values for K manually
• “Elbow” method: trade-off point between number of clusters and within cluster variance
• Silhouette method: test robustness of cluster membership
117

Elbow method
• As K increases, the average
diameter (variance) of clusters
is also getting smaller
• Find a “sweet spot” at which
the decrease in variance sharply
changes
• Sometimes not so clear
118

Silhouette method
• Visual method for determining the number of clusters
• 𝑎 𝑖 – the average distance of point 𝑖 to all other points in its cluster
• 𝑏(𝑖) – the smallest average distance of point 𝑖 to points in another cluster (distance to the
closest neighbouring cluster)
• 𝑠 𝑖 =
𝑏 𝑖 −𝑎(𝑖)
max 𝑎 𝑖 ,𝑏(𝑖)
• 𝑠 𝑖 =
𝑏(𝑖)
, 𝑖𝑓𝑏 𝑖 > 𝑎(𝑖)
0, 𝑖𝑓 𝑏 𝑖 = 𝑎(𝑖)
𝑎(𝑖)
, 𝑖𝑓𝑏 𝑖 < 𝑎(𝑖)
119

Silhouette method
• 𝑠 𝑖 =
1 −
𝑎(𝑖)
𝑏(𝑖)
, 𝑖𝑓𝑏 𝑖 > 𝑎(𝑖)
0, 𝑖𝑓 𝑏 𝑖 = 𝑎(𝑖)
𝑏 𝑖
𝑎(𝑖)
− 1, 𝑖𝑓𝑏 𝑖 < 𝑎(𝑖)
• 𝑠 𝑖 =
1, 𝑖𝑓𝑏 𝑖 ≫ 𝑎 𝑖
0, 𝑖𝑓 𝑏 𝑖 = 𝑎(𝑖)
−1, 𝑖𝑓𝑏 𝑖 ≪ 𝑎(𝑖)
• 𝑠 𝑖 =
1, 𝑔𝑜𝑜𝑑 𝑓𝑖𝑡 𝑡𝑜 𝑡ℎ𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟
0, 𝑏𝑜𝑟𝑑𝑒𝑟𝑙𝑖𝑛𝑒 𝑓𝑖𝑡
−1, 𝑏𝑎𝑑 𝑓𝑖𝑡 𝑡𝑜 𝑡ℎ𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟
120

Silhouette coefficient
• The average 𝑠(𝑖) for all data points
• Pick the number of clusters that
maximizes the average silhouette
coefficient
126

Silhouette coefficient and Elbow
method in WEKA
127

Challenges and solutions
• High dimensionality & feature (attribute) selection
• Categorical attributes
• “Weirdly-shaped” clusters
• Outliers
128

Curse of dimensionality
• Euclidean distance metric 𝑖 𝑎𝑖 − 𝑏𝑖
2
• In a highly-dimensional space with 𝑑 dimensions (0.0–1.0):
• Highly likely that for at least for one feature 𝑖, the value 𝑎𝑖 − 𝑏𝑖 will be close to 1.0
• This puts the lower bound on the distance at 1.0
• However, the upper bound is 𝑑
• Most pairs of data points being far from this upper bound
• Most data points have distance close to average distance
• Many irrelevant dimensions for most clusters
• The noise in irrelevant dimensions masks real differences between clusters
129

Curse of dimensionality: some solutions
• Feature transformation methods (essentially compression)
• Creating smaller number of new, synthetic features based on the larger number of input
features which are then used for clustering
• Principal component analysis (PCA)
• Singular-value decomposition (SVD)
• Feature (attribute) selection methods
• Searching for a subset of features that are relevant for a given domain.
• Minimize entropy
• Idea that feature spaces that contain tight clusters have low entropy
• Subspace clustering – extension of attribute selection
130

Curse of dimensionality: some solutions
• Popular algorithms:
• CLIQUE: A Dimension-Growth Subspace Clustering Method
• Start with a single dimension and grow space by adding new dimensions
• PROCLUS: A Dimension-Reduction Subspace Clustering Method
• Starts with the complete high-dimensional space and assigns weight of each
dimension for every cluster which are used to regenerate clusters
• Explores dense subspace regions
131

Categorical data
• Most clustering algorithms focus on clustering with continuous numerical attributes (ratio
variables)
• How to cluster categorical data? E.g., clustering students based on their demographic
characteristics:
• Gender
• Program
• Study level (postgraduate vs. undergraduate)
• Domestic/international
132

Categorical data: simple solution
• Ignore the problem, threat categorical data as numerical:
• Male: 1, Female: 2
• Domestic: 1, International: 2
• Often does not produce good results.
• Distance metric is not meaningful.
• Point A: (Male, Domestic)
• Point B: (Female, Domestic)
• Point C: (Female, International)
• Is point B closer to point A or point C?
• Depends on the information value of these two features
• “Localized method“
• If two distinct clusters have few points that are close, they might be merged together
incorrectly.
133
C
A B
Gender
Dom/Intl
1 2
1
2

Categorical data: custom algorithms
• ROCK (RObust Clustering using linKs)
• A Hierarchical Clustering Algorithm for Categorical Attributes
• Two data points are similar is they have similar neighbours
• Typical example: market basket data
134

“Weirdly-shaped” clusters
• Most algorithms focus on distance
between data points
• However, often the connectedness of
data points is also important
• Different algorithms developed for
these situations
135

• Distance-based clustering
• Group objects based on distance
among them
• Density-based clustering
• Group objects based on area they
occupy
136

CURE
• Pick a subsample of data and cluster it using a
method such as hierarchical clustering
• Pick N characteristic points per each cluster that
are most distant from each other.
• Move representative points for a fraction towards
cluster centroid.
• Merge two clusters which have representative
points sufficiently close.
137

DBSCAN
• DBSCAN (Density-based spatial clustering of
applications with noise)
• Density-based algorithm
• Searches for areas with large number of points
• General idea:
• Each data point is either core point, reachable
point or outlier
• Core points have minP (parameter) points
around them in the radius r (parameter)
• Reachable points are in radius r of a core point
• Every other data point is an outlier
138
minP=4
Red: core
Yellow: reachable
Blue: outliers

Self-organising maps (SOM)
• Special type of neural network
• Used to learn the contour of the underlying data
• Neuron laid out in a grid structure, each neuron connected
with neighbours and all input nodes
• For each data point, a neuron which is closest to it gets
adjusted, with adjustments being propagated to
neighbouring neurons
• Over time, neurons will position themselves in the shape of
the data
• Dense areas with many neurons indicate clusters
139

Expectation-maximization (EM) clustering
• Much more general than clustering
• Used to estimate hidden (latent) parameters
• Does not require number of clusters to be specified
• General idea:
• Pick number of clusters K
• Fit K distributions over clustering variables with their parameters P
• Estimate likelihood of all data points being generated by any of the K distributions (expectation)
• For every data point, sum likelihoods of being generated by any of the K distributions
• Combine weights with the data to produce new estimates for parameters P
• Repeat until convergence is reached (no parameter change)
141

142

143

Analysis of cluster differences
144

Analysis of cluster differences
• We can check the differences between clusters with regards to
• Clustering variables (e.g., number of logins, number of discussion posts)
• Some additional variables (e.g., student grades, age, gender)
• We can examine difference
• One variable at a time (univariate differences)
• Across multiple variables simultaneously (multivariate differences)
• Takes into consideration the interaction among multiple variables
145

Univariate analysis of cluster differences
• For every variable we can use parametric and non-parametric univariate tests:
• Two clusters: t-test and Mann-Whitney
• Three or more clusters: One-way ANOVA and Kruskal-Wallis
• Requires p-value adjustment (e.g., Bonferroni, Holm-Bonferroni correction)
• Whether to use parametric or non-parametric primarily depends on the homogeneity
(equality) of variance assumption
• Can be tested with Levene’s test
• If Levene’s test shows p<.05, use Mann-Whitney and Kruskal-Wallis
• Significant ANOVAs tests can be followed by pairwise tests (e.g., TukeyHSD)
• Significant Kruskal-Wallis tests can be followed by pairwise KW tests (with also p-
value correction)
146

Multivariate analysis of cluster differences
• We can test differences across all variables at the same time
• More holistic than ANOVA/KW
• Instead of one dependent variable, we can have multiple variables
• Use meaningful groups of variables (e.g., behavioural variables)
• MANOVA: Multivariate analysis of variance
• Step “before” ANOVAs/KWs
• Has several statistics: Wilk’s Λ, Pillai’s Trace
• Assumption: Homogeneity of covariance
• Much trickier to test: Box’s M one method, but it is very sensitive (use p<.001)
• Use Levene’s tests on each of the variables (doesn’t guarantee homogeneity of
covariance but might help)
• If assumption is violated, still can be used but with more robust metric (Pillai’s Trace)
147

Example MANOVA
“For assessing the difference between student clusters a multivariate analysis of
variance (MANOVA) was used. To validate the difference between the discovered
clusters a MANOVA model with cluster assignment as a single independent variable and
thirteen clustering variables as the dependent measures was constructed…”
Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities of
inquiry: Effects of learning technology use on cognitive presence in asynchronous online discussions. The
Internet and Higher Education, 27, 74–89. doi:10.1016/j.iheduc.2015.06.002
148

Example MANOVA
“Before running MANOVAs, … the homogeneity of covariances assumption was checked using
Box’s M test and homogeneity of variances using Levine’s test. To protect from the assumption
violations, we log-transformed the data and used the Pillai’s trace statistic which is considered to
be a robust against assumption violations. As a final protection measure, obtained MANOVA
results were compared with the results of the robust rank-based variation of the MANOVA
analysis”
149

Example MANOVA
“The assumption of homogeneity of covariances was tested using Box’s M test which was not
accepted. Thus, Pillai’s trace statistic was used, as it is more robust to the assumption violations
together with the Bonferroni correction method. A statistically significant MANOVA effect was
obtained, Pillai’s Trace = 1.62, F(39, 174) = 5.28, p < 10-14”
150

MANOVA Follow-up analyses
• Significant MANOVA can be followed-up with two types of analyses:
• Individual ANOVAs/KWs (with p-value correction)
• Which in turn can be followed with pairwise analyses: TukeyHSD/Pairwise KWs
• Discriminatory Factor Analysis (DFA)
• What combinations of variables differentiate between clusters
• DFA can be run alone (without MANOVA) but its significance then can’t be tested
151

152
Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities
of inquiry: Effects of learning technology use on cognitive presence in asynchronous online
discussions. The Internet and Higher Education, 27, 74–89. doi:10.1016/j.iheduc.2015.06.002

Example DFA analysis
153
Kovanović, V., Gašević, D., Joksimović, S., Hatala, M., & Adesope, O. (2015). Analytics of communities
of inquiry: Effects of learning technology use on cognitive presence in asynchronous online
discussions. The Internet and Higher Education, 27, 74–89. doi:10.1016/j.iheduc.2015.06.002

Unsupervised Learning for Learning Analytics Researchers

Recommended

Recommended

More Related Content

Similar to Unsupervised Learning for Learning Analytics Researchers

Similar to Unsupervised Learning for Learning Analytics Researchers (20)

More from Vitomir Kovanovic

More from Vitomir Kovanovic (20)

Recently uploaded

Recently uploaded (20)

Unsupervised Learning for Learning Analytics Researchers