The recommendations system for source code components retrieval
2016 Cymer Intern
1.
2. Slide 2
Educational Background
Education:
• Bachelor of Science in Computer Science
• Georgia Institute of Technology
• Expected graduation date May 2019
• Big Data Club: entity tagging news sources
3. Project Goals
Slide 3
Objective:
To build a text mining model which indicates when the rate of top
keywords changes or when a new keyword emerges.
Background:
• Service Request (SR) is generated whenever an FSE works on a laser
• Some SRs do not replace any part
• SR’s main free bodies of text are: Customer Description, Problem
Found, Task Description, and Resolution
5. Data Pipeline
Slide 5
User filters which SRs to
process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT WARE
-> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire
8. Remove
Slide 8
Not all text adds meaning to the analysis
• Dates
• Numbers
• Stopwords
• Regex
Hand selected words that should be removed: GROUP, END
Words only to be used in pairs: INCREASE, MO
User filters which SRs
to process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT
WARE -> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire
9. Methodology
Slide 9
Recurring Keywords:
• Python script embedded in Spotfire
• Each word stored once for overall usage and once for its given month
• Word maps to a unique set of SRs that the word is used in
• Number of total and monthly SRs are kept
Emerging Trends:
• R script embedded in Spotfire
• Hypergeometric test compares the most recent two months
• Same statistical test used for EWI
10. Project Outcomes
Slide 10
Created Spotfire Dashboard:
• Pulls data from SQL
• Processes data with R and Python
• Interactive display
SR
Script
11. Text Mining Extension: Background
Slide 11
Reliability manually classifies SRs into ~30 categories
• Each SR takes about 1 min
• Classifying SRs related to XL Immersion
• 13,063 classified SRs to date
Objective: To create and train a model that predicts the category for a given
SR.
12. Text Mining Extension: Methodology
Slide 12
Methodology
• Count term usage
• TF-IDF: Term frequency – inverse document frequency
• Train an SVM classifier against pre-categorized SRs
Achieved 75% accuracy using training set of 12000 SRs and testing set of
1000 SRs
This is an example
document. This
document means
something
This second document
represents something
else
[1, 2, 0, 1, 1, 1, 0, 0, 1, 2]
[0, 1, 1, 0, 0, 0, 1, 1, 1, 1]
[ 0.34, 0.48, 0. , 0.34 …]
[ 0. , 0.33, 0.47, 0. …]
Editor's Notes
MO EFFICIENCY ISSUES
Efficiency becomes effici
Usefulness of model is based in large part on the replacement of similar words and removal of useless words
Success of model largely based on strong filter words