SlideShare a Scribd company logo
1 of 13
Slide 2
Educational Background
Education:
• Bachelor of Science in Computer Science
• Georgia Institute of Technology
• Expected graduation date May 2019
• Big Data Club: entity tagging news sources
Project Goals
Slide 3
Objective:
To build a text mining model which indicates when the rate of top
keywords changes or when a new keyword emerges.
Background:
• Service Request (SR) is generated whenever an FSE works on a laser
• Some SRs do not replace any part
• SR’s main free bodies of text are: Customer Description, Problem
Found, Task Description, and Resolution
Project Goal
Slide 4
SR
EWI
Analysis
EWI
Text Mining
SR
Part
Replacement
Automated
Monitoring
Data Pipeline
Slide 5
User filters which SRs to
process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT WARE
-> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire
Pre-Processing
Tokenize
• Example: “DOSE ERROR COMMUNICATION …”
• Result: [“DOSE”,”ERROR”, “COMMUNICATION”…]
Group
• Some words mean more as a group
• [“DOSE_ERROR”, “ERROR_COMMUNICATION”…]
Stem
• Many words mean roughly the same thing
• Optimizing, optimized, optimal, optimize all become optimiz
© 2016 Cymer,
LLC
6
User filters which SRs
to process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT
WARE -> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire
Replace
Stemming doesn’t handle all derivations of a word
• NEON, NE -> NE
• SOFTWARE, SW, SFOT_WARE -> SW
Hand selection of similar words
Deep learning spell correction
• Not all words in SR have a dictionary spelling
• Find similarly used words according to word2vec (Python API)
• Compare spelling according to Levenshtein Distance
© 2016 Cymer,
LLC
7
User filters which SRs
to process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT
WARE -> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire
Remove
Slide 8
Not all text adds meaning to the analysis
• Dates
• Numbers
• Stopwords
• Regex
Hand selected words that should be removed: GROUP, END
Words only to be used in pairs: INCREASE, MO
User filters which SRs
to process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT
WARE -> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire
Methodology
Slide 9
Recurring Keywords:
• Python script embedded in Spotfire
• Each word stored once for overall usage and once for its given month
• Word maps to a unique set of SRs that the word is used in
• Number of total and monthly SRs are kept
Emerging Trends:
• R script embedded in Spotfire
• Hypergeometric test compares the most recent two months
• Same statistical test used for EWI
Project Outcomes
Slide 10
Created Spotfire Dashboard:
• Pulls data from SQL
• Processes data with R and Python
• Interactive display
SR
Script
Text Mining Extension: Background
Slide 11
Reliability manually classifies SRs into ~30 categories
• Each SR takes about 1 min
• Classifying SRs related to XL Immersion
• 13,063 classified SRs to date
Objective: To create and train a model that predicts the category for a given
SR.
Text Mining Extension: Methodology
Slide 12
Methodology
• Count term usage
• TF-IDF: Term frequency – inverse document frequency
• Train an SVM classifier against pre-categorized SRs
Achieved 75% accuracy using training set of 12000 SRs and testing set of
1000 SRs
This is an example
document. This
document means
something
This second document
represents something
else
[1, 2, 0, 1, 1, 1, 0, 0, 1, 2]
[0, 1, 1, 0, 0, 0, 1, 1, 1, 1]
[ 0.34, 0.48, 0. , 0.34 …]
[ 0. , 0.33, 0.47, 0. …]
2016 Cymer Intern

More Related Content

What's hot

Fully Utilizing Spark for Data Validation
Fully Utilizing Spark for Data ValidationFully Utilizing Spark for Data Validation
Fully Utilizing Spark for Data ValidationDatabricks
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)Amazon Web Services
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Amazon Web Services
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksAmazon Web Services
 
ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...
ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...
ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...Amazon Web Services
 
Journey Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million UsersJourney Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million UsersAdrian Hornsby
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
 

What's hot (8)

Fully Utilizing Spark for Data Validation
Fully Utilizing Spark for Data ValidationFully Utilizing Spark for Data Validation
Fully Utilizing Spark for Data Validation
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksA Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks
 
ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...
ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...
ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...
 
Journey Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million UsersJourney Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million Users
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 

Viewers also liked

Certificate of Completions
Certificate of CompletionsCertificate of Completions
Certificate of CompletionsStephen Saade
 
Getting started with one drive
Getting started with one driveGetting started with one drive
Getting started with one driveofficialNazran
 
Assignment Prime Australia Sample on Employability Skills
Assignment Prime Australia Sample on Employability SkillsAssignment Prime Australia Sample on Employability Skills
Assignment Prime Australia Sample on Employability SkillsAdam Jackson
 
How to turn your expert analysis into exceptional reports
How to turn your expert analysis into exceptional reportsHow to turn your expert analysis into exceptional reports
How to turn your expert analysis into exceptional reportsJacob Funnell
 
Sample Assignment on Leadership & Management Development
Sample Assignment on Leadership & Management DevelopmentSample Assignment on Leadership & Management Development
Sample Assignment on Leadership & Management DevelopmentAdam Jackson
 

Viewers also liked (8)

Certificate of Completions
Certificate of CompletionsCertificate of Completions
Certificate of Completions
 
Getting started with one drive
Getting started with one driveGetting started with one drive
Getting started with one drive
 
Assignment Prime Australia Sample on Employability Skills
Assignment Prime Australia Sample on Employability SkillsAssignment Prime Australia Sample on Employability Skills
Assignment Prime Australia Sample on Employability Skills
 
How to turn your expert analysis into exceptional reports
How to turn your expert analysis into exceptional reportsHow to turn your expert analysis into exceptional reports
How to turn your expert analysis into exceptional reports
 
Resume
ResumeResume
Resume
 
resume updated.
resume updated.resume updated.
resume updated.
 
final report
final reportfinal report
final report
 
Sample Assignment on Leadership & Management Development
Sample Assignment on Leadership & Management DevelopmentSample Assignment on Leadership & Management Development
Sample Assignment on Leadership & Management Development
 

Similar to 2016 Cymer Intern

Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Hady Elsahar
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorialYiqun Liu
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...Aman Grover
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)
 
apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...
apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...
apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...apidays
 
Deep Learning Automated Helpdesk
Deep Learning Automated HelpdeskDeep Learning Automated Helpdesk
Deep Learning Automated HelpdeskPranav Sharma
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
 
Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2MongoDB
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
Webminar - Novedades de MongoDB 3.2
Webminar - Novedades de MongoDB 3.2Webminar - Novedades de MongoDB 3.2
Webminar - Novedades de MongoDB 3.2Sam_Francis
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an OverviewNatural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overviewalessio_ferrari
 
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...multimediaeval
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
NEW LAUNCH! Natural Language Processing for Data Analytics - MCL343 - re:Inve...
NEW LAUNCH! Natural Language Processing for Data Analytics - MCL343 - re:Inve...NEW LAUNCH! Natural Language Processing for Data Analytics - MCL343 - re:Inve...
NEW LAUNCH! Natural Language Processing for Data Analytics - MCL343 - re:Inve...Amazon Web Services
 
The recommendations system for source code components retrieval
The recommendations system for source code components retrievalThe recommendations system for source code components retrieval
The recommendations system for source code components retrievalAYESHA JAVED
 

Similar to 2016 Cymer Intern (20)

Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Search Basics
Search BasicsSearch Basics
Search Basics
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorial
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...
apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...
apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...
 
Deep Learning Automated Helpdesk
Deep Learning Automated HelpdeskDeep Learning Automated Helpdesk
Deep Learning Automated Helpdesk
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language Processing
 
Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
Webminar - Novedades de MongoDB 3.2
Webminar - Novedades de MongoDB 3.2Webminar - Novedades de MongoDB 3.2
Webminar - Novedades de MongoDB 3.2
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an OverviewNatural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
 
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
NEW LAUNCH! Natural Language Processing for Data Analytics - MCL343 - re:Inve...
NEW LAUNCH! Natural Language Processing for Data Analytics - MCL343 - re:Inve...NEW LAUNCH! Natural Language Processing for Data Analytics - MCL343 - re:Inve...
NEW LAUNCH! Natural Language Processing for Data Analytics - MCL343 - re:Inve...
 
The recommendations system for source code components retrieval
The recommendations system for source code components retrievalThe recommendations system for source code components retrieval
The recommendations system for source code components retrieval
 

2016 Cymer Intern

  • 1.
  • 2. Slide 2 Educational Background Education: • Bachelor of Science in Computer Science • Georgia Institute of Technology • Expected graduation date May 2019 • Big Data Club: entity tagging news sources
  • 3. Project Goals Slide 3 Objective: To build a text mining model which indicates when the rate of top keywords changes or when a new keyword emerges. Background: • Service Request (SR) is generated whenever an FSE works on a laser • Some SRs do not replace any part • SR’s main free bodies of text are: Customer Description, Problem Found, Task Description, and Resolution
  • 4. Project Goal Slide 4 SR EWI Analysis EWI Text Mining SR Part Replacement Automated Monitoring
  • 5. Data Pipeline Slide 5 User filters which SRs to process Extract SR’s text • Customer Description • Problem Found • Task Description Tokenize, group, and stem text • MO PRA -> MO_PRA • OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ Replace similar words • NEON, NE -> NE • SOFTWARE, SW, SOFT WARE -> SW Remove weak words • Dates • Numbers • Stopwords: AND, IS, BUT • Selected words: DUE END GROUP Save SR number with terms Calculate frequency Display in Spotfire
  • 6. Pre-Processing Tokenize • Example: “DOSE ERROR COMMUNICATION …” • Result: [“DOSE”,”ERROR”, “COMMUNICATION”…] Group • Some words mean more as a group • [“DOSE_ERROR”, “ERROR_COMMUNICATION”…] Stem • Many words mean roughly the same thing • Optimizing, optimized, optimal, optimize all become optimiz © 2016 Cymer, LLC 6 User filters which SRs to process Extract SR’s text • Customer Description • Problem Found • Task Description Tokenize, group, and stem text • MO PRA -> MO_PRA • OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ Replace similar words • NEON, NE -> NE • SOFTWARE, SW, SOFT WARE -> SW Remove weak words • Dates • Numbers • Stopwords: AND, IS, BUT • Selected words: DUE END GROUP Save SR number with terms Calculate frequency Display in Spotfire
  • 7. Replace Stemming doesn’t handle all derivations of a word • NEON, NE -> NE • SOFTWARE, SW, SFOT_WARE -> SW Hand selection of similar words Deep learning spell correction • Not all words in SR have a dictionary spelling • Find similarly used words according to word2vec (Python API) • Compare spelling according to Levenshtein Distance © 2016 Cymer, LLC 7 User filters which SRs to process Extract SR’s text • Customer Description • Problem Found • Task Description Tokenize, group, and stem text • MO PRA -> MO_PRA • OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ Replace similar words • NEON, NE -> NE • SOFTWARE, SW, SOFT WARE -> SW Remove weak words • Dates • Numbers • Stopwords: AND, IS, BUT • Selected words: DUE END GROUP Save SR number with terms Calculate frequency Display in Spotfire
  • 8. Remove Slide 8 Not all text adds meaning to the analysis • Dates • Numbers • Stopwords • Regex Hand selected words that should be removed: GROUP, END Words only to be used in pairs: INCREASE, MO User filters which SRs to process Extract SR’s text • Customer Description • Problem Found • Task Description Tokenize, group, and stem text • MO PRA -> MO_PRA • OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ Replace similar words • NEON, NE -> NE • SOFTWARE, SW, SOFT WARE -> SW Remove weak words • Dates • Numbers • Stopwords: AND, IS, BUT • Selected words: DUE END GROUP Save SR number with terms Calculate frequency Display in Spotfire
  • 9. Methodology Slide 9 Recurring Keywords: • Python script embedded in Spotfire • Each word stored once for overall usage and once for its given month • Word maps to a unique set of SRs that the word is used in • Number of total and monthly SRs are kept Emerging Trends: • R script embedded in Spotfire • Hypergeometric test compares the most recent two months • Same statistical test used for EWI
  • 10. Project Outcomes Slide 10 Created Spotfire Dashboard: • Pulls data from SQL • Processes data with R and Python • Interactive display SR Script
  • 11. Text Mining Extension: Background Slide 11 Reliability manually classifies SRs into ~30 categories • Each SR takes about 1 min • Classifying SRs related to XL Immersion • 13,063 classified SRs to date Objective: To create and train a model that predicts the category for a given SR.
  • 12. Text Mining Extension: Methodology Slide 12 Methodology • Count term usage • TF-IDF: Term frequency – inverse document frequency • Train an SVM classifier against pre-categorized SRs Achieved 75% accuracy using training set of 12000 SRs and testing set of 1000 SRs This is an example document. This document means something This second document represents something else [1, 2, 0, 1, 1, 1, 0, 0, 1, 2] [0, 1, 1, 0, 0, 0, 1, 1, 1, 1] [ 0.34, 0.48, 0. , 0.34 …] [ 0. , 0.33, 0.47, 0. …]

Editor's Notes

  1. MO EFFICIENCY ISSUES Efficiency becomes effici Usefulness of model is based in large part on the replacement of similar words and removal of useless words
  2. Success of model largely based on strong filter words