SlideShare a Scribd company logo
Mining the Unseen:
Educational Data!
GALVANIZE CAPSTONE PROJECT
ZACHARY THOMAS
ZTHOMAS.NC@GMAIL.COM | LINKEDIN: THOMASZI | GITHUB: THOMZI12 | KAGGLE: ZTHOMAS
11/30/2016
Could a data science program use data
science insights?
Could a data science program use data
science insights?
How can we extract meaning from
unstructured data?
Data Sources & Tools
Data Science Program
Student Data
Assessment
Data
Experience &
Background
Daily Exercise
Data
Unsupervised
Learning
Data Collection
Student Pairs (hand-typed)
Student assignments
(semi-unstructured)
Assessment data (structured)
+ creativity!
Tabular data, ready for
analysis
What next?
Clustering Algorithms
K-means
clustering
Hierarchical
Clustering
Multi-dimensional
Scaling
DBSCAN OPTICS
Multi-Dimensional Scaling
First we obtain distances, then
use MDS to visualize points and
potential clusters.
26 dimensions → 2 dimensions
Students furthest apart:
59 and 75 (10.386)
K-Means Clustering
Cluster
Percent
of Data
Characteristics
1 23%
Above avg. experience,
below avg. scores
2 62%
Avg. experience, above
average scores
3 15%
Below avg. experience,
good at coding
An inertia plot + human intuition help us
determine the right number of clusters.
Action Steps
Specialized Program
Offerings
Scoping out entry-level positionsDS management training
Targeted
Marketing
Insights Dashboard
Next Steps for this Project
• Build cluster classifier for incoming students
• Obtain feedback from end users of data
Questions?
zthomas.nc@gmail.com | LinkedIn: thomaszi
Github: thomzi12 | Kaggle: zthomas
Thank you!
Appendix
Dataset Insights
Cohort Correlation b/t Initial and Final Student Rank
sf_15 0.435
Sf_16 0.805
s_sept 0.55
den 0.17
s 0.7
*Done by analyzing rank according to first (1) and last (5) assessment scores
Summary Visualizations
Relative performance, math and coding Student Background
Students in orange box are relatively better at math
than coding; students in the green are the opposite
Engineering, mathematics, and economics are the
three most common academic backgrounds of students
Work/Grad School Experience
Median: 6 years
1st quartile: 3 years
3rd quartile: 9 years
`Pvclust` Dendrogram
• Package used: pvclust
• Distance: Euclidean
• Linkage: complete
• Clusters: 8
Hierarchical Clustering Bar plots
DBSCAN
Need to tune DBSCAN better?• KNN plot on data after scaling into two dimensions
• Clear knee at eps = 1.5
OPTICS
Choose epsilon (i.e. reachability, vertical axis) of 1,
producing two clusters. (Upper bound of 1.5)
Very reasonable clusters! Used data scaled
into two dimensions (above) for model
K=3 K-means Inertia Plot
[,1] [,2] [,3]
[1,] 1.000 0.130 0.468
[2,] 0.130 1.000 0.606
[3,] 0.468 0.606 1.000
Cluster Similarity Matrix,
using ’shadow’ method
K=3 K-means Clustering Bar plot
Data Pipeline
• Motivation: Real life data is messy – as I found out with this project.
• Data Sources:
• Handwritten pair assignments for student group assignments
• Student LinkedIn profiles
• Student Github profiles (list of usernames)
• Student assessment data
• Goal: Find the lengths of selected pair programming assignments to use as a proxy for work ethic.
• Selected 9 assignments that were unusually long or difficult – would require more effort/need to
be a quick worker to get through completely
• Process:
• Used script (repo_scraper_optimized.py) to download zip file of student Github profile, look for
.py or .ipynb files, record number of characters in file. Ignore lines that are comments.
• Since hitting Github for data is I/O bound, used threading to run multiple requests
simultaneously.
• Parallelized collection (at one point, using AWS) to speed up even further
Data Pipeline (Part 2)
Process:
• In another script, used fuzzy matching to use student assignment pairs to determine file attribution.
Looked at files with word “pair” in name to single out pair assignments.
• Wrote results to master spreadsheet with assessment data. Now data in format where each row a
different student, with student’s respective data in columns
Challenges:
• Missing values! How to handle and recognize
• Finding pair assignments and not individual assignments
• Pair assignments not in format where each row a unique student (each row is two different students)
Next Steps:
• Improve recall – 10-30% of files for an assignment are missing. Granted, those files may not exist.
• Incorporate data into clustering analysis
Glossary
• Threading – Used in Python to run multiple (I/O bound) processes at one time.
• Fuzzy Matching – This project Python package FuzzyWuzzy. It uses Levenshtein distance (when
comparing strings A and B, the number of deletions, insertions, or substitutions need to transform A
into B). This was used to match hand-typed names with those on file, which were then matched to a
Github username for further processing.
• Hierarchal Clustering – Tried out different methods of clustering (ended up using ‘complete’ linkage,
which looks at the elements furthest away from each other when deciding to combine two clusters or
not. `Single` linkage, which looks at the closest elements between two clusters, gave me chained
clusters), scaled/unscaled data (unscaled data gave more sensible results) and different kinds of
distances (tried out cosine, went with Euclidean though). Package pvclust gave me a result with 8
clusters, some of which were sensible. Many were small though.
• Non-metric Multidimensional Scaling – algorithms gives projection that minimizes stress, the square
root of the ratio of the sum of squared distances between input distances and configuration distances.
Glossary – DBSCAN
• DBSCAN – Short for “Density-Based Spatial Clustering and Application with Noise”. Basically works on
the idea of looking for areas of density (points close together) that are more dense than areas of noise.
(Unlike K-means, which takes a centroid approach that assigns all points to a cluster.)
• DBSCAN Algorithm:
1. Compute distances between x_i and all other points. Find all neighbor points with distance eps of the point. Each point with a
neighbor count greater than or equal to MinPts is marked as a core point or visited.
2. For each core point, assign it to a new cluster if it hasn’t been already. Find recursively all its density related points and assign
them to same cluster as core
3. Iterate through all points. Points that do not belong to a cluster are treated as noise.
• DBSCAN (tuning) parameters epsilon (eps) and minimum points (MinPts) – eps is determined using a
KNN Distance Plot and looking for looking for knee point. MinPts is determined by user. Smaller value
will produce more clusters, larger values less.
Glossary – OPTICS
• OPTICS – Short for “Ordering points to identify the clustering structure.” A generalization of DBSCAN
that addresses the algorithm’s weakness in detecting cluster of differing densities.
• Reachability Plot – Created by OPTICS algorithms, valleys in plots represent points in clusters since their
reachability score (distance to nearest neighbor?) are low. Points are ordered such that they are next to
neighbors. Identify clusters like hierarchical clustering does by ’cutting’ reachability plot.

More Related Content

What's hot

Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learningUjjawal
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)Pravinkumar Landge
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balázs Hidasi
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmPınar Yahşi
 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning ANKUSH PAL
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering methodrajshreemuthiah
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscanYan Xu
 
Neural netorksmatching
Neural netorksmatchingNeural netorksmatching
Neural netorksmatchingMasa Kato
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...butest
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithmVinit Dantkale
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jaxAjay Iet
 

What's hot (20)

Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learning
 
Unsupervised Learning
Unsupervised LearningUnsupervised Learning
Unsupervised Learning
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning
 
Kmeans
KmeansKmeans
Kmeans
 
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
Neural netorksmatching
Neural netorksmatchingNeural netorksmatching
Neural netorksmatching
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jax
 

Viewers also liked

Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment NAXA-Developers
 
Historia del arte 2
Historia del arte 2Historia del arte 2
Historia del arte 2Ana de Avila
 
Matematicas Power Point
Matematicas Power PointMatematicas Power Point
Matematicas Power Pointana f
 
שתי סדנאות וסרט
שתי סדנאות וסרטשתי סדנאות וסרט
שתי סדנאות וסרטאלון
 
guideflyer2015 - Sans QR Code
guideflyer2015 - Sans QR Codeguideflyer2015 - Sans QR Code
guideflyer2015 - Sans QR CodePhilip Arbaugh
 
Tour Guide Training Slides
Tour Guide Training SlidesTour Guide Training Slides
Tour Guide Training Slidesdon47203
 
E-Commerce Paris 2016 Brochure
E-Commerce Paris 2016 BrochureE-Commerce Paris 2016 Brochure
E-Commerce Paris 2016 BrochureParis Retail Week
 
5.(1주제 절삭가공) 기계가공현장의 개선 포인트(제출)
5.(1주제 절삭가공) 기계가공현장의 개선 포인트(제출)5.(1주제 절삭가공) 기계가공현장의 개선 포인트(제출)
5.(1주제 절삭가공) 기계가공현장의 개선 포인트(제출)topshock
 
The seven golden principals of tour guiding
The seven golden principals of tour guidingThe seven golden principals of tour guiding
The seven golden principals of tour guidingOanh Nam
 
Sublinear tolerant property_testing_halfplane
Sublinear tolerant property_testing_halfplaneSublinear tolerant property_testing_halfplane
Sublinear tolerant property_testing_halfplaneIgor Kleiner
 

Viewers also liked (13)

Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment
 
Historia del arte 2
Historia del arte 2Historia del arte 2
Historia del arte 2
 
Matematicas Power Point
Matematicas Power PointMatematicas Power Point
Matematicas Power Point
 
שתי סדנאות וסרט
שתי סדנאות וסרטשתי סדנאות וסרט
שתי סדנאות וסרט
 
Aerografia
Aerografia Aerografia
Aerografia
 
guideflyer2015 - Sans QR Code
guideflyer2015 - Sans QR Codeguideflyer2015 - Sans QR Code
guideflyer2015 - Sans QR Code
 
Tour Guide Training Slides
Tour Guide Training SlidesTour Guide Training Slides
Tour Guide Training Slides
 
agriculture careers
agriculture careersagriculture careers
agriculture careers
 
E-Commerce Paris 2016 Brochure
E-Commerce Paris 2016 BrochureE-Commerce Paris 2016 Brochure
E-Commerce Paris 2016 Brochure
 
5.(1주제 절삭가공) 기계가공현장의 개선 포인트(제출)
5.(1주제 절삭가공) 기계가공현장의 개선 포인트(제출)5.(1주제 절삭가공) 기계가공현장의 개선 포인트(제출)
5.(1주제 절삭가공) 기계가공현장의 개선 포인트(제출)
 
The seven golden principals of tour guiding
The seven golden principals of tour guidingThe seven golden principals of tour guiding
The seven golden principals of tour guiding
 
iBeacon
iBeaconiBeacon
iBeacon
 
Sublinear tolerant property_testing_halfplane
Sublinear tolerant property_testing_halfplaneSublinear tolerant property_testing_halfplane
Sublinear tolerant property_testing_halfplane
 

Similar to Could a Data Science Program use Data Science Insights?

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster AnalysisSuman Mia
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptxssusere1fd42
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
Ability Study of Proximity Measure for Big Data Mining Context on ClusteringAbility Study of Proximity Measure for Big Data Mining Context on Clustering
Ability Study of Proximity Measure for Big Data Mining Context on ClusteringKamleshKumar394
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...Madan Golla
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningKnoldus Inc.
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 

Similar to Could a Data Science Program use Data Science Insights? (20)

DBSCAN (1) (4).pptx
DBSCAN (1) (4).pptxDBSCAN (1) (4).pptx
DBSCAN (1) (4).pptx
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster Analysis
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
PPT s10-machine vision-s2
PPT s10-machine vision-s2PPT s10-machine vision-s2
PPT s10-machine vision-s2
 
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
Ability Study of Proximity Measure for Big Data Mining Context on ClusteringAbility Study of Proximity Measure for Big Data Mining Context on Clustering
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 

Recently uploaded

Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单nscud
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单enxupq
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .NABLAS株式会社
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Domenico Conte
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单enxupq
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesStarCompliance.io
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单ewymefz
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?DOT TECH
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单nscud
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单ewymefz
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单ukgaet
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBAlireza Kamrani
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhArpitMalhotra16
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sMAQIB18
 

Recently uploaded (20)

Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 

Could a Data Science Program use Data Science Insights?

  • 1. Mining the Unseen: Educational Data! GALVANIZE CAPSTONE PROJECT ZACHARY THOMAS ZTHOMAS.NC@GMAIL.COM | LINKEDIN: THOMASZI | GITHUB: THOMZI12 | KAGGLE: ZTHOMAS 11/30/2016
  • 2. Could a data science program use data science insights?
  • 3. Could a data science program use data science insights? How can we extract meaning from unstructured data?
  • 4. Data Sources & Tools Data Science Program Student Data Assessment Data Experience & Background Daily Exercise Data Unsupervised Learning
  • 5. Data Collection Student Pairs (hand-typed) Student assignments (semi-unstructured) Assessment data (structured) + creativity! Tabular data, ready for analysis What next?
  • 7. Multi-Dimensional Scaling First we obtain distances, then use MDS to visualize points and potential clusters. 26 dimensions → 2 dimensions Students furthest apart: 59 and 75 (10.386)
  • 8. K-Means Clustering Cluster Percent of Data Characteristics 1 23% Above avg. experience, below avg. scores 2 62% Avg. experience, above average scores 3 15% Below avg. experience, good at coding An inertia plot + human intuition help us determine the right number of clusters.
  • 9. Action Steps Specialized Program Offerings Scoping out entry-level positionsDS management training Targeted Marketing
  • 11. Next Steps for this Project • Build cluster classifier for incoming students • Obtain feedback from end users of data
  • 12. Questions? zthomas.nc@gmail.com | LinkedIn: thomaszi Github: thomzi12 | Kaggle: zthomas Thank you!
  • 14. Dataset Insights Cohort Correlation b/t Initial and Final Student Rank sf_15 0.435 Sf_16 0.805 s_sept 0.55 den 0.17 s 0.7 *Done by analyzing rank according to first (1) and last (5) assessment scores
  • 15. Summary Visualizations Relative performance, math and coding Student Background Students in orange box are relatively better at math than coding; students in the green are the opposite Engineering, mathematics, and economics are the three most common academic backgrounds of students
  • 16. Work/Grad School Experience Median: 6 years 1st quartile: 3 years 3rd quartile: 9 years
  • 17. `Pvclust` Dendrogram • Package used: pvclust • Distance: Euclidean • Linkage: complete • Clusters: 8
  • 19. DBSCAN Need to tune DBSCAN better?• KNN plot on data after scaling into two dimensions • Clear knee at eps = 1.5
  • 20. OPTICS Choose epsilon (i.e. reachability, vertical axis) of 1, producing two clusters. (Upper bound of 1.5) Very reasonable clusters! Used data scaled into two dimensions (above) for model
  • 21. K=3 K-means Inertia Plot [,1] [,2] [,3] [1,] 1.000 0.130 0.468 [2,] 0.130 1.000 0.606 [3,] 0.468 0.606 1.000 Cluster Similarity Matrix, using ’shadow’ method
  • 23. Data Pipeline • Motivation: Real life data is messy – as I found out with this project. • Data Sources: • Handwritten pair assignments for student group assignments • Student LinkedIn profiles • Student Github profiles (list of usernames) • Student assessment data • Goal: Find the lengths of selected pair programming assignments to use as a proxy for work ethic. • Selected 9 assignments that were unusually long or difficult – would require more effort/need to be a quick worker to get through completely • Process: • Used script (repo_scraper_optimized.py) to download zip file of student Github profile, look for .py or .ipynb files, record number of characters in file. Ignore lines that are comments. • Since hitting Github for data is I/O bound, used threading to run multiple requests simultaneously. • Parallelized collection (at one point, using AWS) to speed up even further
  • 24. Data Pipeline (Part 2) Process: • In another script, used fuzzy matching to use student assignment pairs to determine file attribution. Looked at files with word “pair” in name to single out pair assignments. • Wrote results to master spreadsheet with assessment data. Now data in format where each row a different student, with student’s respective data in columns Challenges: • Missing values! How to handle and recognize • Finding pair assignments and not individual assignments • Pair assignments not in format where each row a unique student (each row is two different students) Next Steps: • Improve recall – 10-30% of files for an assignment are missing. Granted, those files may not exist. • Incorporate data into clustering analysis
  • 25. Glossary • Threading – Used in Python to run multiple (I/O bound) processes at one time. • Fuzzy Matching – This project Python package FuzzyWuzzy. It uses Levenshtein distance (when comparing strings A and B, the number of deletions, insertions, or substitutions need to transform A into B). This was used to match hand-typed names with those on file, which were then matched to a Github username for further processing. • Hierarchal Clustering – Tried out different methods of clustering (ended up using ‘complete’ linkage, which looks at the elements furthest away from each other when deciding to combine two clusters or not. `Single` linkage, which looks at the closest elements between two clusters, gave me chained clusters), scaled/unscaled data (unscaled data gave more sensible results) and different kinds of distances (tried out cosine, went with Euclidean though). Package pvclust gave me a result with 8 clusters, some of which were sensible. Many were small though. • Non-metric Multidimensional Scaling – algorithms gives projection that minimizes stress, the square root of the ratio of the sum of squared distances between input distances and configuration distances.
  • 26. Glossary – DBSCAN • DBSCAN – Short for “Density-Based Spatial Clustering and Application with Noise”. Basically works on the idea of looking for areas of density (points close together) that are more dense than areas of noise. (Unlike K-means, which takes a centroid approach that assigns all points to a cluster.) • DBSCAN Algorithm: 1. Compute distances between x_i and all other points. Find all neighbor points with distance eps of the point. Each point with a neighbor count greater than or equal to MinPts is marked as a core point or visited. 2. For each core point, assign it to a new cluster if it hasn’t been already. Find recursively all its density related points and assign them to same cluster as core 3. Iterate through all points. Points that do not belong to a cluster are treated as noise. • DBSCAN (tuning) parameters epsilon (eps) and minimum points (MinPts) – eps is determined using a KNN Distance Plot and looking for looking for knee point. MinPts is determined by user. Smaller value will produce more clusters, larger values less.
  • 27. Glossary – OPTICS • OPTICS – Short for “Ordering points to identify the clustering structure.” A generalization of DBSCAN that addresses the algorithm’s weakness in detecting cluster of differing densities. • Reachability Plot – Created by OPTICS algorithms, valleys in plots represent points in clusters since their reachability score (distance to nearest neighbor?) are low. Points are ordered such that they are next to neighbors. Identify clusters like hierarchical clustering does by ’cutting’ reachability plot.

Editor's Notes

  1. Data Science Immersive Program Student Data Assessment data Years of experience & background (LinkedIn) Daily exercise data
  2. Explain how this is: A good result based on algorithm (mathematical) measures E.g. shadow values, silhouette plot, etc. A good result based on human intuition Look at bar plots