Slide 2
Educational Background
Education:
• Bachelor of Science in Computer Science
• Georgia Institute of Technology
• Expected graduation date May 2019
• Big Data Club: entity tagging news sources
Project Goals
Slide 3
Objective:
To build a text mining model which indicates when the rate of top
keywords changes or when a new keyword emerges.
Background:
• Service Request (SR) is generated whenever an FSE works on a laser
• Some SRs do not replace any part
• SR’s main free bodies of text are: Customer Description, Problem
Found, Task Description, and Resolution
Project Goal
Slide 4
SR
EWI
Analysis
EWI
Text Mining
SR
Part
Replacement
Automated
Monitoring
Data Pipeline
Slide 5
User filters which SRs to
process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT WARE
-> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire
Pre-Processing
Tokenize
• Example: “DOSE ERROR COMMUNICATION …”
• Result: [“DOSE”,”ERROR”, “COMMUNICATION”…]
Group
• Some words mean more as a group
• [“DOSE_ERROR”, “ERROR_COMMUNICATION”…]
Stem
• Many words mean roughly the same thing
• Optimizing, optimized, optimal, optimize all become optimiz
© 2016 Cymer,
LLC
6
User filters which SRs
to process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT
WARE -> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire
Replace
Stemming doesn’t handle all derivations of a word
• NEON, NE -> NE
• SOFTWARE, SW, SFOT_WARE -> SW
Hand selection of similar words
Deep learning spell correction
• Not all words in SR have a dictionary spelling
• Find similarly used words according to word2vec (Python API)
• Compare spelling according to Levenshtein Distance
© 2016 Cymer,
LLC
7
User filters which SRs
to process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT
WARE -> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire
Remove
Slide 8
Not all text adds meaning to the analysis
• Dates
• Numbers
• Stopwords
• Regex
Hand selected words that should be removed: GROUP, END
Words only to be used in pairs: INCREASE, MO
User filters which SRs
to process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT
WARE -> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire
Methodology
Slide 9
Recurring Keywords:
• Python script embedded in Spotfire
• Each word stored once for overall usage and once for its given month
• Word maps to a unique set of SRs that the word is used in
• Number of total and monthly SRs are kept
Emerging Trends:
• R script embedded in Spotfire
• Hypergeometric test compares the most recent two months
• Same statistical test used for EWI
Project Outcomes
Slide 10
Created Spotfire Dashboard:
• Pulls data from SQL
• Processes data with R and Python
• Interactive display
SR
Script
Text Mining Extension: Background
Slide 11
Reliability manually classifies SRs into ~30 categories
• Each SR takes about 1 min
• Classifying SRs related to XL Immersion
• 13,063 classified SRs to date
Objective: To create and train a model that predicts the category for a given
SR.
Text Mining Extension: Methodology
Slide 12
Methodology
• Count term usage
• TF-IDF: Term frequency – inverse document frequency
• Train an SVM classifier against pre-categorized SRs
Achieved 75% accuracy using training set of 12000 SRs and testing set of
1000 SRs
This is an example
document. This
document means
something
This second document
represents something
else
[1, 2, 0, 1, 1, 1, 0, 0, 1, 2]
[0, 1, 1, 0, 0, 0, 1, 1, 1, 1]
[ 0.34, 0.48, 0. , 0.34 …]
[ 0. , 0.33, 0.47, 0. …]
2016 Cymer Intern

2016 Cymer Intern

  • 2.
    Slide 2 Educational Background Education: •Bachelor of Science in Computer Science • Georgia Institute of Technology • Expected graduation date May 2019 • Big Data Club: entity tagging news sources
  • 3.
    Project Goals Slide 3 Objective: Tobuild a text mining model which indicates when the rate of top keywords changes or when a new keyword emerges. Background: • Service Request (SR) is generated whenever an FSE works on a laser • Some SRs do not replace any part • SR’s main free bodies of text are: Customer Description, Problem Found, Task Description, and Resolution
  • 4.
    Project Goal Slide 4 SR EWI Analysis EWI TextMining SR Part Replacement Automated Monitoring
  • 5.
    Data Pipeline Slide 5 Userfilters which SRs to process Extract SR’s text • Customer Description • Problem Found • Task Description Tokenize, group, and stem text • MO PRA -> MO_PRA • OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ Replace similar words • NEON, NE -> NE • SOFTWARE, SW, SOFT WARE -> SW Remove weak words • Dates • Numbers • Stopwords: AND, IS, BUT • Selected words: DUE END GROUP Save SR number with terms Calculate frequency Display in Spotfire
  • 6.
    Pre-Processing Tokenize • Example: “DOSEERROR COMMUNICATION …” • Result: [“DOSE”,”ERROR”, “COMMUNICATION”…] Group • Some words mean more as a group • [“DOSE_ERROR”, “ERROR_COMMUNICATION”…] Stem • Many words mean roughly the same thing • Optimizing, optimized, optimal, optimize all become optimiz © 2016 Cymer, LLC 6 User filters which SRs to process Extract SR’s text • Customer Description • Problem Found • Task Description Tokenize, group, and stem text • MO PRA -> MO_PRA • OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ Replace similar words • NEON, NE -> NE • SOFTWARE, SW, SOFT WARE -> SW Remove weak words • Dates • Numbers • Stopwords: AND, IS, BUT • Selected words: DUE END GROUP Save SR number with terms Calculate frequency Display in Spotfire
  • 7.
    Replace Stemming doesn’t handleall derivations of a word • NEON, NE -> NE • SOFTWARE, SW, SFOT_WARE -> SW Hand selection of similar words Deep learning spell correction • Not all words in SR have a dictionary spelling • Find similarly used words according to word2vec (Python API) • Compare spelling according to Levenshtein Distance © 2016 Cymer, LLC 7 User filters which SRs to process Extract SR’s text • Customer Description • Problem Found • Task Description Tokenize, group, and stem text • MO PRA -> MO_PRA • OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ Replace similar words • NEON, NE -> NE • SOFTWARE, SW, SOFT WARE -> SW Remove weak words • Dates • Numbers • Stopwords: AND, IS, BUT • Selected words: DUE END GROUP Save SR number with terms Calculate frequency Display in Spotfire
  • 8.
    Remove Slide 8 Not alltext adds meaning to the analysis • Dates • Numbers • Stopwords • Regex Hand selected words that should be removed: GROUP, END Words only to be used in pairs: INCREASE, MO User filters which SRs to process Extract SR’s text • Customer Description • Problem Found • Task Description Tokenize, group, and stem text • MO PRA -> MO_PRA • OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ Replace similar words • NEON, NE -> NE • SOFTWARE, SW, SOFT WARE -> SW Remove weak words • Dates • Numbers • Stopwords: AND, IS, BUT • Selected words: DUE END GROUP Save SR number with terms Calculate frequency Display in Spotfire
  • 9.
    Methodology Slide 9 Recurring Keywords: •Python script embedded in Spotfire • Each word stored once for overall usage and once for its given month • Word maps to a unique set of SRs that the word is used in • Number of total and monthly SRs are kept Emerging Trends: • R script embedded in Spotfire • Hypergeometric test compares the most recent two months • Same statistical test used for EWI
  • 10.
    Project Outcomes Slide 10 CreatedSpotfire Dashboard: • Pulls data from SQL • Processes data with R and Python • Interactive display SR Script
  • 11.
    Text Mining Extension:Background Slide 11 Reliability manually classifies SRs into ~30 categories • Each SR takes about 1 min • Classifying SRs related to XL Immersion • 13,063 classified SRs to date Objective: To create and train a model that predicts the category for a given SR.
  • 12.
    Text Mining Extension:Methodology Slide 12 Methodology • Count term usage • TF-IDF: Term frequency – inverse document frequency • Train an SVM classifier against pre-categorized SRs Achieved 75% accuracy using training set of 12000 SRs and testing set of 1000 SRs This is an example document. This document means something This second document represents something else [1, 2, 0, 1, 1, 1, 0, 0, 1, 2] [0, 1, 1, 0, 0, 0, 1, 1, 1, 1] [ 0.34, 0.48, 0. , 0.34 …] [ 0. , 0.33, 0.47, 0. …]

Editor's Notes

  • #6 MO EFFICIENCY ISSUES Efficiency becomes effici Usefulness of model is based in large part on the replacement of similar words and removal of useless words
  • #8 Success of model largely based on strong filter words