Could a Data Science Program use Data Science Insights?

Mining the Unseen:
Educational Data!
GALVANIZE CAPSTONE PROJECT
ZACHARY THOMAS
ZTHOMAS.NC@GMAIL.COM | LINKEDIN: THOMASZI | GITHUB: THOMZI12 | KAGGLE: ZTHOMAS
11/30/2016

Could a data science program use data
science insights?

Could a data science program use data
science insights?
How can we extract meaning from
unstructured data?

Data Sources & Tools
Data Science Program
Student Data
Assessment
Data
Experience &
Background
Daily Exercise
Data
Unsupervised
Learning

Data Collection
Student Pairs (hand-typed)
Student assignments
(semi-unstructured)
Assessment data (structured)
+ creativity!
Tabular data, ready for
analysis
What next?

Clustering Algorithms
K-means
clustering
Hierarchical
Clustering
Multi-dimensional
Scaling
DBSCAN OPTICS

Multi-Dimensional Scaling
First we obtain distances, then
use MDS to visualize points and
potential clusters.
26 dimensions → 2 dimensions
Students furthest apart:
59 and 75 (10.386)

K-Means Clustering
Cluster
Percent
of Data
Characteristics
1 23%
Above avg. experience,
below avg. scores
2 62%
Avg. experience, above
average scores
3 15%
Below avg. experience,
good at coding
An inertia plot + human intuition help us
determine the right number of clusters.

Action Steps
Specialized Program
Offerings
Scoping out entry-level positionsDS management training
Targeted
Marketing

Next Steps for this Project
• Build cluster classifier for incoming students
• Obtain feedback from end users of data

Questions?
zthomas.nc@gmail.com | LinkedIn: thomaszi
Github: thomzi12 | Kaggle: zthomas
Thank you!

Dataset Insights
Cohort Correlation b/t Initial and Final Student Rank
sf_15 0.435
Sf_16 0.805
s_sept 0.55
den 0.17
s 0.7
*Done by analyzing rank according to first (1) and last (5) assessment scores

Summary Visualizations
Relative performance, math and coding Student Background
Students in orange box are relatively better at math
than coding; students in the green are the opposite
Engineering, mathematics, and economics are the
three most common academic backgrounds of students

Work/Grad School Experience
Median: 6 years
1st quartile: 3 years
3rd quartile: 9 years

`Pvclust` Dendrogram
• Package used: pvclust
• Distance: Euclidean
• Linkage: complete
• Clusters: 8

Hierarchical Clustering Bar plots

DBSCAN
Need to tune DBSCAN better?• KNN plot on data after scaling into two dimensions
• Clear knee at eps = 1.5

OPTICS
Choose epsilon (i.e. reachability, vertical axis) of 1,
producing two clusters. (Upper bound of 1.5)
Very reasonable clusters! Used data scaled
into two dimensions (above) for model

K=3 K-means Inertia Plot
[,1] [,2] [,3]
[1,] 1.000 0.130 0.468
[2,] 0.130 1.000 0.606
[3,] 0.468 0.606 1.000
Cluster Similarity Matrix,
using ’shadow’ method

K=3 K-means Clustering Bar plot

Data Pipeline
• Motivation: Real life data is messy – as I found out with this project.
• Data Sources:
• Handwritten pair assignments for student group assignments
• Student LinkedIn profiles
• Student Github profiles (list of usernames)
• Student assessment data
• Goal: Find the lengths of selected pair programming assignments to use as a proxy for work ethic.
• Selected 9 assignments that were unusually long or difficult – would require more effort/need to
be a quick worker to get through completely
• Process:
• Used script (repo_scraper_optimized.py) to download zip file of student Github profile, look for
.py or .ipynb files, record number of characters in file. Ignore lines that are comments.
• Since hitting Github for data is I/O bound, used threading to run multiple requests
simultaneously.
• Parallelized collection (at one point, using AWS) to speed up even further

Data Pipeline (Part 2)
Process:
• In another script, used fuzzy matching to use student assignment pairs to determine file attribution.
Looked at files with word “pair” in name to single out pair assignments.
• Wrote results to master spreadsheet with assessment data. Now data in format where each row a
different student, with student’s respective data in columns
Challenges:
• Missing values! How to handle and recognize
• Finding pair assignments and not individual assignments
• Pair assignments not in format where each row a unique student (each row is two different students)
Next Steps:
• Improve recall – 10-30% of files for an assignment are missing. Granted, those files may not exist.
• Incorporate data into clustering analysis

Glossary
• Threading – Used in Python to run multiple (I/O bound) processes at one time.
• Fuzzy Matching – This project Python package FuzzyWuzzy. It uses Levenshtein distance (when
comparing strings A and B, the number of deletions, insertions, or substitutions need to transform A
into B). This was used to match hand-typed names with those on file, which were then matched to a
Github username for further processing.
• Hierarchal Clustering – Tried out different methods of clustering (ended up using ‘complete’ linkage,
which looks at the elements furthest away from each other when deciding to combine two clusters or
not. `Single` linkage, which looks at the closest elements between two clusters, gave me chained
clusters), scaled/unscaled data (unscaled data gave more sensible results) and different kinds of
distances (tried out cosine, went with Euclidean though). Package pvclust gave me a result with 8
clusters, some of which were sensible. Many were small though.
• Non-metric Multidimensional Scaling – algorithms gives projection that minimizes stress, the square
root of the ratio of the sum of squared distances between input distances and configuration distances.

Glossary – DBSCAN
• DBSCAN – Short for “Density-Based Spatial Clustering and Application with Noise”. Basically works on
the idea of looking for areas of density (points close together) that are more dense than areas of noise.
(Unlike K-means, which takes a centroid approach that assigns all points to a cluster.)
• DBSCAN Algorithm:
1. Compute distances between x_i and all other points. Find all neighbor points with distance eps of the point. Each point with a
neighbor count greater than or equal to MinPts is marked as a core point or visited.
2. For each core point, assign it to a new cluster if it hasn’t been already. Find recursively all its density related points and assign
them to same cluster as core
3. Iterate through all points. Points that do not belong to a cluster are treated as noise.
• DBSCAN (tuning) parameters epsilon (eps) and minimum points (MinPts) – eps is determined using a
KNN Distance Plot and looking for looking for knee point. MinPts is determined by user. Smaller value
will produce more clusters, larger values less.

Glossary – OPTICS
• OPTICS – Short for “Ordering points to identify the clustering structure.” A generalization of DBSCAN
that addresses the algorithm’s weakness in detecting cluster of differing densities.
• Reachability Plot – Created by OPTICS algorithms, valleys in plots represent points in clusters since their
reachability score (distance to nearest neighbor?) are low. Points are ordered such that they are next to
neighbors. Identify clusters like hierarchical clustering does by ’cutting’ reachability plot.

Could a Data Science Program use Data Science Insights?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Could a Data Science Program use Data Science Insights?

Similar to Could a Data Science Program use Data Science Insights? (20)

Recently uploaded

Recently uploaded (20)

Could a Data Science Program use Data Science Insights?

Editor's Notes