I'm Zachary Thomas, a recent graduate of Galvanize's data science immersive. Attached the slide deck for my capstone presentation! Email me with questions at zthomas.nc@gmail.com
7. Multi-Dimensional Scaling
First we obtain distances, then
use MDS to visualize points and
potential clusters.
26 dimensions → 2 dimensions
Students furthest apart:
59 and 75 (10.386)
8. K-Means Clustering
Cluster
Percent
of Data
Characteristics
1 23%
Above avg. experience,
below avg. scores
2 62%
Avg. experience, above
average scores
3 15%
Below avg. experience,
good at coding
An inertia plot + human intuition help us
determine the right number of clusters.
14. Dataset Insights
Cohort Correlation b/t Initial and Final Student Rank
sf_15 0.435
Sf_16 0.805
s_sept 0.55
den 0.17
s 0.7
*Done by analyzing rank according to first (1) and last (5) assessment scores
15. Summary Visualizations
Relative performance, math and coding Student Background
Students in orange box are relatively better at math
than coding; students in the green are the opposite
Engineering, mathematics, and economics are the
three most common academic backgrounds of students
19. DBSCAN
Need to tune DBSCAN better?• KNN plot on data after scaling into two dimensions
• Clear knee at eps = 1.5
20. OPTICS
Choose epsilon (i.e. reachability, vertical axis) of 1,
producing two clusters. (Upper bound of 1.5)
Very reasonable clusters! Used data scaled
into two dimensions (above) for model
23. Data Pipeline
• Motivation: Real life data is messy – as I found out with this project.
• Data Sources:
• Handwritten pair assignments for student group assignments
• Student LinkedIn profiles
• Student Github profiles (list of usernames)
• Student assessment data
• Goal: Find the lengths of selected pair programming assignments to use as a proxy for work ethic.
• Selected 9 assignments that were unusually long or difficult – would require more effort/need to
be a quick worker to get through completely
• Process:
• Used script (repo_scraper_optimized.py) to download zip file of student Github profile, look for
.py or .ipynb files, record number of characters in file. Ignore lines that are comments.
• Since hitting Github for data is I/O bound, used threading to run multiple requests
simultaneously.
• Parallelized collection (at one point, using AWS) to speed up even further
24. Data Pipeline (Part 2)
Process:
• In another script, used fuzzy matching to use student assignment pairs to determine file attribution.
Looked at files with word “pair” in name to single out pair assignments.
• Wrote results to master spreadsheet with assessment data. Now data in format where each row a
different student, with student’s respective data in columns
Challenges:
• Missing values! How to handle and recognize
• Finding pair assignments and not individual assignments
• Pair assignments not in format where each row a unique student (each row is two different students)
Next Steps:
• Improve recall – 10-30% of files for an assignment are missing. Granted, those files may not exist.
• Incorporate data into clustering analysis
25. Glossary
• Threading – Used in Python to run multiple (I/O bound) processes at one time.
• Fuzzy Matching – This project Python package FuzzyWuzzy. It uses Levenshtein distance (when
comparing strings A and B, the number of deletions, insertions, or substitutions need to transform A
into B). This was used to match hand-typed names with those on file, which were then matched to a
Github username for further processing.
• Hierarchal Clustering – Tried out different methods of clustering (ended up using ‘complete’ linkage,
which looks at the elements furthest away from each other when deciding to combine two clusters or
not. `Single` linkage, which looks at the closest elements between two clusters, gave me chained
clusters), scaled/unscaled data (unscaled data gave more sensible results) and different kinds of
distances (tried out cosine, went with Euclidean though). Package pvclust gave me a result with 8
clusters, some of which were sensible. Many were small though.
• Non-metric Multidimensional Scaling – algorithms gives projection that minimizes stress, the square
root of the ratio of the sum of squared distances between input distances and configuration distances.
26. Glossary – DBSCAN
• DBSCAN – Short for “Density-Based Spatial Clustering and Application with Noise”. Basically works on
the idea of looking for areas of density (points close together) that are more dense than areas of noise.
(Unlike K-means, which takes a centroid approach that assigns all points to a cluster.)
• DBSCAN Algorithm:
1. Compute distances between x_i and all other points. Find all neighbor points with distance eps of the point. Each point with a
neighbor count greater than or equal to MinPts is marked as a core point or visited.
2. For each core point, assign it to a new cluster if it hasn’t been already. Find recursively all its density related points and assign
them to same cluster as core
3. Iterate through all points. Points that do not belong to a cluster are treated as noise.
• DBSCAN (tuning) parameters epsilon (eps) and minimum points (MinPts) – eps is determined using a
KNN Distance Plot and looking for looking for knee point. MinPts is determined by user. Smaller value
will produce more clusters, larger values less.
27. Glossary – OPTICS
• OPTICS – Short for “Ordering points to identify the clustering structure.” A generalization of DBSCAN
that addresses the algorithm’s weakness in detecting cluster of differing densities.
• Reachability Plot – Created by OPTICS algorithms, valleys in plots represent points in clusters since their
reachability score (distance to nearest neighbor?) are low. Points are ordered such that they are next to
neighbors. Identify clusters like hierarchical clustering does by ’cutting’ reachability plot.
Editor's Notes
Data Science Immersive Program Student Data
Assessment data
Years of experience & background (LinkedIn)
Daily exercise data
Explain how this is:
A good result based on algorithm (mathematical) measures
E.g. shadow values, silhouette plot, etc.
A good result based on human intuition
Look at bar plots