Clustering Models to Assist in Student Outreach

•Download as PPTX, PDF•

0 likes•213 views

Presentation delivered at Texas Association of Institutional Research on the applications of unsupervised clustering models to drive student outreach, as well as a general overview of common algorithms.

Data & Analytics

1. Why clustering?
2. Example 1, hierarchical clustering
3. Questions
4. Overview of common algorithms and strengths/limitations
5. Example 2, K-means clustering
6. Questions
2

4
Reverse looking
What are characteristics of
students who used/did not use
a service? Did/did not persist?
Who is the prototypical
student in each academic
program?
Forward looking
What characteristics comprise
our prospective student
personas?
How do risk factors overlap in
some groups of students?

The challenge
Contrary to national trends, Dallas College has experienced larger
than typical drops in female student re-enrollment patterns. Our
goal was to stem decreases for the Fall 2021 semester through a
text campaign.
The question
What messaging should be used to resonate with these students?
6

45k students
Diverse features for the model
• Demographics: race/ethnicity, gender, age
• Financial: income, employment intensity
• Household: household size, dependent count
• Academics: last term of enrollment, credits, GPA
• Special Pop membership
7

8
Step 1
Starts with a case as leaf node
Each case goes into that leaf
node or breaks into a new
node
Result is a Cluster Features
tree
Step 2
Leaf nodes are combined
through agglomeration
Result is several “best”
clusters with a silhouette score

10
African Am.
3 dependents
Part-time job
Hispanic
4 dependents
Part-time job
Hispanic
2 dependents
Full-time job
White
1 dependent
Full-time job

BIRCH and Two-step (SPSS proprietary; Ex 1)
K-means, K-modes, K-medoids (Ex 2)
Hierarchical (Agglomerative and Divisive)
12

15
Clusters based on
distance cutoff line
Distance cutoff line

Algorithm Variable Types
BIRCH / 2-Step Any, multiple at the same time
Hierarchical Any, but stick to one kind at a time and use
well-paired distance measure
K-means Continuous
K-modes Categorical
K-medoids Any, multiple at the same time with the
right distance measure (Gower)
16

17
Algorithm +/-
BIRCH / 2-Step Flexible variable types, fast compute, large
data; there is an element of a black box
Hierarchical Flexible modeling, highly explainable;
limited to small data
K-Means, K-Modes Easy, fast, large data; sensitive to K, curse
of dimensionality, clusters same sized
K-medoids Like K-Means but more flexible variable
types and more costly tuning & compute

• Population: 24 years and older adults enrolled for Fall 2021
• Features : Academics, Demographic, Financial , Household and Veteran status.
• Mix of discrete and continuous variables with majority of them being categorical data.
19
Below Federal poverty
Level
Below
Median
Lower half Above
Median
Upper half Above
Median
Median household income in Texas
Poverty Flag INCOME BIN

20
V1 V2 V3 V4 V5
0 1 1 0 1
V1 V2 V3 V4 V5
1 0 0 0 1
1+1+1+1 = 4
Dissimilarity measure
0: Mismatch 1: Match
• Randomly select the K initial centers
• Repeat
1. Assign the samples to nearest center
2. Update means/modes based on newly formed cluster
3. Calculate the cost (SSE/Sum of dissimilarity)
• Stop when cluster centers converges
Using frequency-based
method to calculate mode
instead of the mean of the
sample
Minimize the cost function
Sample 1
Sample 2

22
Missing values for categorical variables
such as employment status
Missing values for
continuous variables
such as income
Possible solutions
• Revisit our data warehouse to obtain as much
information as we can to fill in the missing values
• Imputation with K-Nearest neighbors with
Hamming distance for categorical data
• Imputation with K-Nearest neighbors with
Euclidean distance for continuous data

Jeremy Anderson, Associate Vice Chancellor of Strategic
Analytics, jeremy.anderson@dcccd.edu
Dillon Lu, Data Analyst, DLu@dcccd.edu

Similar to Clustering Models to Assist in Student Outreach

Proposal defense apr19rgeurtz

Using Data to Mobilize Commuities and Change LivesRaisingTheBar2015

CB FORUM FINALDan Lundquist

Margaret Patton, Dissertation Defense, Dr. William Allan Kritsonis, PhD Disse...William Kritsonis

Dr. Margaret Curette Patton, PhD Dissertation, Dr. William Allan Kritsonis, D...William Kritsonis

Transitioning to Common Core: What it means, What to look forCurriculum Associates

Local Critical Issue Version 2Tim Hasse

Union Col lectureDan Lundquist

Rabbani - EducationCopenhagen_Consensus

Getting to the Root Causes of Disproportionate Representation in Special Educ...SPPTAP

Connecticut Core 2014EdAdvance

2.why the common_core_presentation_with_facilitators_notes_update_072213WRHSlibrary

New Canaan BOEEdAdvance

SACAC Session E.12 Sealing the DealRaise.me

Localcriticalissueversion2 091109081645 Phpapp01Tim Hasse

Chautauqua County School Board DinnerJohn Sipple

Funding Dries Up For Non Profit And Educational Institutions Serving Black Co...Larry Cochran, MBA

Credit risk predictive analytics Data Science Society

Missouri ACT Identified Keys to Enrollment SuccessStephaneGeyer

Top 11 Metrics Every Financial Aid Director Should Be MeasuringCampusLogic

Similar to Clustering Models to Assist in Student Outreach (20)

Proposal defense apr19

Using Data to Mobilize Commuities and Change Lives

CB FORUM FINAL

Margaret Patton, Dissertation Defense, Dr. William Allan Kritsonis, PhD Disse...

Dr. Margaret Curette Patton, PhD Dissertation, Dr. William Allan Kritsonis, D...

Transitioning to Common Core: What it means, What to look for

Local Critical Issue Version 2

Union Col lecture

Rabbani - Education

Getting to the Root Causes of Disproportionate Representation in Special Educ...

Connecticut Core 2014

2.why the common_core_presentation_with_facilitators_notes_update_072213

New Canaan BOE

SACAC Session E.12 Sealing the Deal

Localcriticalissueversion2 091109081645 Phpapp01

Chautauqua County School Board Dinner

Funding Dries Up For Non Profit And Educational Institutions Serving Black Co...

Credit risk predictive analytics

Missouri ACT Identified Keys to Enrollment Success

Top 11 Metrics Every Financial Aid Director Should Be Measuring

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

How we prevented account sharing with MFAAndrei Kaleshka

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Decoding Loan Approval: Predictive Modeling in ActionBoston Institute of Analytics

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Call Girls in Saket 99530🔝 56974 Escort Service

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

How we prevented account sharing with MFA

Schema on read is obsolete. Welcome metaprogramming..pdf

9654467111 Call Girls In Munirka Hotel And Home Service

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

04242024_CCC TUG_Joins and Relationships

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx

RA-11058_IRR-COMPRESS Do 198 series of 1998

Call Girls In Dwarka 9654467111 Escorts Service

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Decoding Loan Approval: Predictive Modeling in Action

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

Brighton SEO | April 2024 | Data Storytelling

Clustering Models to Assist in Student Outreach

1. February 8, 2022

2. 1. Why clustering? 2. Example 1, hierarchical clustering 3. Questions 4. Overview of common algorithms and strengths/limitations 5. Example 2, K-means clustering 6. Questions 2

3. 3

4. 4 Reverse looking What are characteristics of students who used/did not use a service? Did/did not persist? Who is the prototypical student in each academic program? Forward looking What characteristics comprise our prospective student personas? How do risk factors overlap in some groups of students?

5. 5

6. The challenge Contrary to national trends, Dallas College has experienced larger than typical drops in female student re-enrollment patterns. Our goal was to stem decreases for the Fall 2021 semester through a text campaign. The question What messaging should be used to resonate with these students? 6

7. 45k students Diverse features for the model • Demographics: race/ethnicity, gender, age • Financial: income, employment intensity • Household: household size, dependent count • Academics: last term of enrollment, credits, GPA • Special Pop membership 7

8. 8 Step 1 Starts with a case as leaf node Each case goes into that leaf node or breaks into a new node Result is a Cluster Features tree Step 2 Leaf nodes are combined through agglomeration Result is several “best” clusters with a silhouette score

9. 9

10. 10 African Am. 3 dependents Part-time job Hispanic 4 dependents Part-time job Hispanic 2 dependents Full-time job White 1 dependent Full-time job

11. 11

12. BIRCH and Two-step (SPSS proprietary; Ex 1) K-means, K-modes, K-medoids (Ex 2) Hierarchical (Agglomerative and Divisive) 12

13. 13

14. 14

15. 15 Clusters based on distance cutoff line Distance cutoff line

16. Algorithm Variable Types BIRCH / 2-Step Any, multiple at the same time Hierarchical Any, but stick to one kind at a time and use well-paired distance measure K-means Continuous K-modes Categorical K-medoids Any, multiple at the same time with the right distance measure (Gower) 16

17. 17 Algorithm +/- BIRCH / 2-Step Flexible variable types, fast compute, large data; there is an element of a black box Hierarchical Flexible modeling, highly explainable; limited to small data K-Means, K-Modes Easy, fast, large data; sensitive to K, curse of dimensionality, clusters same sized K-medoids Like K-Means but more flexible variable types and more costly tuning & compute

18. 18

19. • Population: 24 years and older adults enrolled for Fall 2021 • Features : Academics, Demographic, Financial , Household and Veteran status. • Mix of discrete and continuous variables with majority of them being categorical data. 19 Below Federal poverty Level Below Median Lower half Above Median Upper half Above Median Median household income in Texas Poverty Flag INCOME BIN

20. 20 V1 V2 V3 V4 V5 0 1 1 0 1 V1 V2 V3 V4 V5 1 0 0 0 1 1+1+1+1 = 4 Dissimilarity measure 0: Mismatch 1: Match • Randomly select the K initial centers • Repeat 1. Assign the samples to nearest center 2. Update means/modes based on newly formed cluster 3. Calculate the cost (SSE/Sum of dissimilarity) • Stop when cluster centers converges Using frequency-based method to calculate mode instead of the mean of the sample Minimize the cost function Sample 1 Sample 2

21. 21 Student Personas

22. 22 Missing values for categorical variables such as employment status Missing values for continuous variables such as income Possible solutions • Revisit our data warehouse to obtain as much information as we can to fill in the missing values • Imputation with K-Nearest neighbors with Hamming distance for categorical data • Imputation with K-Nearest neighbors with Euclidean distance for continuous data

23.

24.

25. Jeremy Anderson, Associate Vice Chancellor of Strategic Analytics, jeremy.anderson@dcccd.edu Dillon Lu, Data Analyst, DLu@dcccd.edu

Editor's Notes

REFER to recent Dallas Morning News Article and its Framing Give brief overview – focused on 2 components – Data and Demographics by Core Populations AND Interventions to help support all students.
You have a population in n dimensions of space Your model breaks that into subpopulations that are self-similar and distinct from other subpopulations There are many algorithms that attack this basic task in different ways
The idea with all of these kinds of research question is that you would take a very large population and break it down into smaller groups. Then, you customize the messaging to fit the cluster personas. Used a lot in the marketing world to create customer personas, but has applications to any kind of “product” like academic programs and student services.
The case count was fairly high. 45k+ female students had enrolled in prior three terms, had not attained a credential or transferred, and were not enrolled for the Fall. Some variables like income and household size are highly missing if we rely just on FAFSA because only about half of our students complete it. Instead, we supplement by using a homegrown Student Information Profile that goes out 1x a year and asks those and other questions. Part of data preparation is merging the two data streams, which also is a challenge because they are binned as categorical in the SIP but are continuous in FAFSA. The other variables are highly available and mostly clean, so very little cleanup was necessary. Other than that, we do some regular feature engineering around membership in special populations like athletes, international students, foster students, veterans, etc. That's code that we've written and reuse across projects to save some time. After the data prep, I stepped back and saw that the features for consideration were all mixed. Some were binary and others categorical or continuous. For that reason, and for the size of the data, I chose the two-step clustering algorithm in SPSS because it works well under both conditions and because, as a social scientist by training, I am most familiar with SPSS as an analysis tool. With the data ready, I then played around with the features to maximize the evaluation score I was seeing. Loaded them all in at once, then looked at the cluster overlap charts which is a feature in SPSS. Took out ones that didn’t have strong separation and was left with a collection of features that would be most useful/impactful for final modeling.
The tool I chose to use was SPSS because that's my bread and butter as a social scientist by trade. It has a couple of built-in options for clustering. In step 1, the similarity score, based on the distance between the case and the node values, is the determining factor. If the distance is above the algorithm's threshold and the node would spread to wide to incorporate it, then the data point goes into a newly created node. In step 2, the nodes from the tree start to get clumped together, also based on their distances. Lots of different cluster counts are tried and evaluated automatically with one of the available clustering criteria that you can choose from in the model settings in SPSS Ultimately, you get a set of the best clusters for the data. The model picks the number of clusters automatically through the Cluster Tree creation and the evaluation of the agglomerations.
I got my silhouette measure that gave me a sense of how well the points in each cluster were self-similar and how much each cluster was different from the others. Score was ranging from 0.6 to 0.7, which falls in the good range Was able to click into the scoring for details. That’s the clusters matrix that shows the clusters as columns and the features used to build the clusters as rows Can click on one of the cells to see that feature’s overall distribution and what’s present in the cluster in the Cluster Distribution visual Can click multiple clusters to see how they compare in a visual way in the Cluster Comparison
Hispanic, Black, or White, with 3-4 dependents, employed part-time Hispanic, 3 dependents, employed full-time Black or White, 1-2 dependents, employed full-time Overriding message was POC with more dependents, especially if employed less than full-time. The hypothesis there was that we should focus messaging on emergency aid Whereas full-time workers with fewer dependents might benefit from an awareness campaign about our childcare assistance. Really, though, we needed to test these assumptions. Used these clusters to then create a call list of ~350 students with the aim of talking to 50 (10-15 per cluster). The phone interviews were short, about 10 minutes, and guided by a set of template questions that our Student Success Research team and our student success staff worked on together. Results from the qualitative discussions left us with a collection of themes for each cluster and some prospective talking points to use in the text campaign. Ended up with 5600 students responding to the text campaign and re-enrolling, equating to an 18% response
Before we take a quick tour of some other clustering algorithms and their strengths and weaknesses, let's pause for one or two comments or questions about our first example. We will also have some time at the end for other questions.
Today, we're looking at three very common categories of clustering algorithms, though there are others that are applicable to other types of challenges. Already covered two-step, which is similar to BIRCH, in example 1. The point there is that it blends the approach of the other types of algorithms listed to get the best of both worlds. For comparison, then, we’ll look at these other two common approaches in the K-algorithms and hierarchical algorithm.
The premise of all three of these is you pick a number of K (clusters). Rather than pick at random, two good ways to narrow in on the best K for the data are: elbow method – for this it would be somewhere between 6 and 8 that you might want to try Silhouette Coefficient The cluster picking methods and the algorithms in general are available in Python's Scikit-learn package and they're in R, but it'll take a few packages
So, how does it work, generally? K-means places points as the center of the clusters, whereas K-medoids chooses the most representative case. K-medoids is less sensitive to outliers, like using the median rather than the mean, so it may be preferable to K-means depending on your data K-mode, meanwhile, looks for the number of similarities between data points when considering the different features, but it still collapses all of this into a pre-determined number of clusters that you set.
Match up pairs of cases based on distance between them Then match up pairs of pairs, also based on distance Pick what distance you’re okay with and use that to determine the clusters. That's where it's different from Two-Step and BIRCH which will do that automatically. Divisive flips this on its head and starts with all of the variables in one Available in Python scikit-learn and R
Divide our student body into groups so we can look at each specific group more closely and identify what they need and what can we do to help them reach their educational goals.
K-mode clustering Determine initial number of clusters K Minimize cost function while maintain relatively small value for K. Purpose of study is that we want to divide our student body into large groups so we can study each group more effectively therefore to have a better understanding of our student body as a whole. We prefer generalization rather than looking at specific cases. Choose 4 in this case and move on to the next page
This information can help us to develop strategic plans that focus on a particular group of student to help them to achieve their goals and be successful here at our institution.
Many students with missing income end up in the second income bin due to nature of our dataset design Revisit our data and eliminate as many missing value as possible to obtain a more complete data should help greatly to have our students more evenly distributed in INCOME_BIN variable

Clustering Models to Assist in Student Outreach

Recommended

Recommended

More Related Content

Similar to Clustering Models to Assist in Student Outreach

Similar to Clustering Models to Assist in Student Outreach (20)

More from Jeremy Anderson

More from Jeremy Anderson (20)

Recently uploaded

Recently uploaded (20)

Clustering Models to Assist in Student Outreach

Editor's Notes