Presentation delivered at Texas Association of Institutional Research on the applications of unsupervised clustering models to drive student outreach, as well as a general overview of common algorithms.
4. 4
Reverse looking
What are characteristics of
students who used/did not use
a service? Did/did not persist?
Who is the prototypical
student in each academic
program?
Forward looking
What characteristics comprise
our prospective student
personas?
How do risk factors overlap in
some groups of students?
6. The challenge
Contrary to national trends, Dallas College has experienced larger
than typical drops in female student re-enrollment patterns. Our
goal was to stem decreases for the Fall 2021 semester through a
text campaign.
The question
What messaging should be used to resonate with these students?
6
7. 45k students
Diverse features for the model
• Demographics: race/ethnicity, gender, age
• Financial: income, employment intensity
• Household: household size, dependent count
• Academics: last term of enrollment, credits, GPA
• Special Pop membership
7
8. 8
Step 1
Starts with a case as leaf node
Each case goes into that leaf
node or breaks into a new
node
Result is a Cluster Features
tree
Step 2
Leaf nodes are combined
through agglomeration
Result is several “best”
clusters with a silhouette score
16. Algorithm Variable Types
BIRCH / 2-Step Any, multiple at the same time
Hierarchical Any, but stick to one kind at a time and use
well-paired distance measure
K-means Continuous
K-modes Categorical
K-medoids Any, multiple at the same time with the
right distance measure (Gower)
16
17. 17
Algorithm +/-
BIRCH / 2-Step Flexible variable types, fast compute, large
data; there is an element of a black box
Hierarchical Flexible modeling, highly explainable;
limited to small data
K-Means, K-Modes Easy, fast, large data; sensitive to K, curse
of dimensionality, clusters same sized
K-medoids Like K-Means but more flexible variable
types and more costly tuning & compute
19. • Population: 24 years and older adults enrolled for Fall 2021
• Features : Academics, Demographic, Financial , Household and Veteran status.
• Mix of discrete and continuous variables with majority of them being categorical data.
19
Below Federal poverty
Level
Below
Median
Lower half Above
Median
Upper half Above
Median
Median household income in Texas
Poverty Flag INCOME BIN
20. 20
V1 V2 V3 V4 V5
0 1 1 0 1
V1 V2 V3 V4 V5
1 0 0 0 1
1+1+1+1 = 4
Dissimilarity measure
0: Mismatch 1: Match
• Randomly select the K initial centers
• Repeat
1. Assign the samples to nearest center
2. Update means/modes based on newly formed cluster
3. Calculate the cost (SSE/Sum of dissimilarity)
• Stop when cluster centers converges
Using frequency-based
method to calculate mode
instead of the mean of the
sample
Minimize the cost function
Sample 1
Sample 2
22. 22
Missing values for categorical variables
such as employment status
Missing values for
continuous variables
such as income
Possible solutions
• Revisit our data warehouse to obtain as much
information as we can to fill in the missing values
• Imputation with K-Nearest neighbors with
Hamming distance for categorical data
• Imputation with K-Nearest neighbors with
Euclidean distance for continuous data
23.
24.
25. Jeremy Anderson, Associate Vice Chancellor of Strategic
Analytics, jeremy.anderson@dcccd.edu
Dillon Lu, Data Analyst, DLu@dcccd.edu
Editor's Notes
REFER to recent Dallas Morning News Article and its Framing
Give brief overview – focused on 2 components – Data and Demographics by Core Populations AND Interventions to help support all students.
You have a population in n dimensions of space
Your model breaks that into subpopulations that are self-similar and distinct from other subpopulations
There are many algorithms that attack this basic task in different ways
The idea with all of these kinds of research question is that you would take a very large population and break it down into smaller groups.
Then, you customize the messaging to fit the cluster personas.
Used a lot in the marketing world to create customer personas, but has applications to any kind of “product” like academic programs and student services.
The case count was fairly high.
45k+ female students had enrolled in prior three terms, had not attained a credential or transferred, and were not enrolled for the Fall.
Some variables like income and household size are highly missing if we rely just on FAFSA because only about half of our students complete it. Instead, we supplement by using a homegrown Student Information Profile that goes out 1x a year and asks those and other questions. Part of data preparation is merging the two data streams, which also is a challenge because they are binned as categorical in the SIP but are continuous in FAFSA.
The other variables are highly available and mostly clean, so very little cleanup was necessary.
Other than that, we do some regular feature engineering around membership in special populations like athletes, international students, foster students, veterans, etc. That's code that we've written and reuse across projects to save some time.
After the data prep, I stepped back and saw that the features for consideration were all mixed. Some were binary and others categorical or continuous. For that reason, and for the size of the data, I chose the two-step clustering algorithm in SPSS because it works well under both conditions and because, as a social scientist by training, I am most familiar with SPSS as an analysis tool.
With the data ready, I then played around with the features to maximize the evaluation score I was seeing. Loaded them all in at once, then looked at the cluster overlap charts which is a feature in SPSS. Took out ones that didn’t have strong separation and was left with a collection of features that would be most useful/impactful for final modeling.
The tool I chose to use was SPSS because that's my bread and butter as a social scientist by trade. It has a couple of built-in options for clustering.
In step 1, the similarity score, based on the distance between the case and the node values, is the determining factor. If the distance is above the algorithm's threshold and the node would spread to wide to incorporate it, then the data point goes into a newly created node.
In step 2, the nodes from the tree start to get clumped together, also based on their distances. Lots of different cluster counts are tried and evaluated automatically with one of the available clustering criteria that you can choose from in the model settings in SPSS
Ultimately, you get a set of the best clusters for the data. The model picks the number of clusters automatically through the Cluster Tree creation and the evaluation of the agglomerations.
I got my silhouette measure that gave me a sense of how well the points in each cluster were self-similar and how much each cluster was different from the others.
Score was ranging from 0.6 to 0.7, which falls in the good range
Was able to click into the scoring for details. That’s the clusters matrix that shows the clusters as columns and the features used to build the clusters as rows
Can click on one of the cells to see that feature’s overall distribution and what’s present in the cluster in the Cluster Distribution visual
Can click multiple clusters to see how they compare in a visual way in the Cluster Comparison
Hispanic, Black, or White, with 3-4 dependents, employed part-time
Hispanic, 3 dependents, employed full-time
Black or White, 1-2 dependents, employed full-time
Overriding message was POC with more dependents, especially if employed less than full-time. The hypothesis there was that we should focus messaging on emergency aid
Whereas full-time workers with fewer dependents might benefit from an awareness campaign about our childcare assistance. Really, though, we needed to test these assumptions.
Used these clusters to then create a call list of ~350 students with the aim of talking to 50 (10-15 per cluster).
The phone interviews were short, about 10 minutes, and guided by a set of template questions that our Student Success Research team and our student success staff worked on together.
Results from the qualitative discussions left us with a collection of themes for each cluster and some prospective talking points to use in the text campaign.
Ended up with 5600 students responding to the text campaign and re-enrolling, equating to an 18% response
Before we take a quick tour of some other clustering algorithms and their strengths and weaknesses, let's pause for one or two comments or questions about our first example. We will also have some time at the end for other questions.
Today, we're looking at three very common categories of clustering algorithms, though there are others that are applicable to other types of challenges.
Already covered two-step, which is similar to BIRCH, in example 1. The point there is that it blends the approach of the other types of algorithms listed to get the best of both worlds.
For comparison, then, we’ll look at these other two common approaches in the K-algorithms and hierarchical algorithm.
The premise of all three of these is you pick a number of K (clusters). Rather than pick at random, two good ways to narrow in on the best K for the data are:
elbow method – for this it would be somewhere between 6 and 8 that you might want to try
Silhouette Coefficient
The cluster picking methods and the algorithms in general are available in Python's Scikit-learn package and they're in R, but it'll take a few packages
So, how does it work, generally?
K-means places points as the center of the clusters, whereas K-medoids chooses the most representative case.
K-medoids is less sensitive to outliers, like using the median rather than the mean, so it may be preferable to K-means depending on your data
K-mode, meanwhile, looks for the number of similarities between data points when considering the different features, but it still collapses all of this into a pre-determined number of clusters that you set.
Match up pairs of cases based on distance between them
Then match up pairs of pairs, also based on distance
Pick what distance you’re okay with and use that to determine the clusters. That's where it's different from Two-Step and BIRCH which will do that automatically.
Divisive flips this on its head and starts with all of the variables in one
Available in Python scikit-learn and R
Divide our student body into groups so we can look at each specific group more closely and identify what they need and what can we do to help them reach their educational goals.
K-mode clustering
Determine initial number of clusters K
Minimize cost function while maintain relatively small value for K.
Purpose of study is that we want to divide our student body into large groups so we can study each group more effectively therefore to have a better understanding of our student body as a whole.
We prefer generalization rather than looking at specific cases.
Choose 4 in this case and move on to the next page
This information can help us to develop strategic plans that focus on a particular group of student to help them to achieve their goals and be successful here at our institution.
Many students with missing income end up in the second income bin due to nature of our dataset design
Revisit our data and eliminate as many missing value as possible to obtain a more complete data should help greatly to have our students more evenly distributed in INCOME_BIN variable