2016 Cymer Intern

•Download as PPTX, PDF•

0 likes•181 views

Akhilesh Aji

Slide 2
Educational Background
Education:
• Bachelor of Science in Computer Science
• Georgia Institute of Technology
• Expected graduation date May 2019
• Big Data Club: entity tagging news sources

Project Goals
Slide 3
Objective:
To build a text mining model which indicates when the rate of top
keywords changes or when a new keyword emerges.
Background:
• Service Request (SR) is generated whenever an FSE works on a laser
• Some SRs do not replace any part
• SR’s main free bodies of text are: Customer Description, Problem
Found, Task Description, and Resolution

Project Goal
Slide 4
SR
EWI
Analysis
EWI
Text Mining
SR
Part
Replacement
Automated
Monitoring

Data Pipeline
Slide 5
User filters which SRs to
process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT WARE
-> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire

Pre-Processing
Tokenize
• Example: “DOSE ERROR COMMUNICATION …”
• Result: [“DOSE”,”ERROR”, “COMMUNICATION”…]
Group
• Some words mean more as a group
• [“DOSE_ERROR”, “ERROR_COMMUNICATION”…]
Stem
• Many words mean roughly the same thing
• Optimizing, optimized, optimal, optimize all become optimiz
© 2016 Cymer,
LLC
6
User filters which SRs
to process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT
WARE -> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire

Replace
Stemming doesn’t handle all derivations of a word
• NEON, NE -> NE
• SOFTWARE, SW, SFOT_WARE -> SW
Hand selection of similar words
Deep learning spell correction
• Not all words in SR have a dictionary spelling
• Find similarly used words according to word2vec (Python API)
• Compare spelling according to Levenshtein Distance
© 2016 Cymer,
LLC
7
User filters which SRs
to process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT
WARE -> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire

Remove
Slide 8
Not all text adds meaning to the analysis
• Dates
• Numbers
• Stopwords
• Regex
Hand selected words that should be removed: GROUP, END
Words only to be used in pairs: INCREASE, MO
User filters which SRs
to process
Extract SR’s text
• Customer Description
• Problem Found
• Task Description
Tokenize, group, and
stem text
• MO PRA -> MO_PRA
• OPTIMIZING, OPTIMIZED,
OPTIMIZES -> OPTIMIZ
Replace similar words
• NEON, NE -> NE
• SOFTWARE, SW, SOFT
WARE -> SW
Remove weak words
• Dates
• Numbers
• Stopwords: AND, IS, BUT
• Selected words: DUE END
GROUP
Save SR number with
terms
Calculate frequency Display in Spotfire

Methodology
Slide 9
Recurring Keywords:
• Python script embedded in Spotfire
• Each word stored once for overall usage and once for its given month
• Word maps to a unique set of SRs that the word is used in
• Number of total and monthly SRs are kept
Emerging Trends:
• R script embedded in Spotfire
• Hypergeometric test compares the most recent two months
• Same statistical test used for EWI

Project Outcomes
Slide 10
Created Spotfire Dashboard:
• Pulls data from SQL
• Processes data with R and Python
• Interactive display
SR
Script

Text Mining Extension: Background
Slide 11
Reliability manually classifies SRs into ~30 categories
• Each SR takes about 1 min
• Classifying SRs related to XL Immersion
• 13,063 classified SRs to date
Objective: To create and train a model that predicts the category for a given
SR.

Text Mining Extension: Methodology
Slide 12
Methodology
• Count term usage
• TF-IDF: Term frequency – inverse document frequency
• Train an SVM classifier against pre-categorized SRs
Achieved 75% accuracy using training set of 12000 SRs and testing set of
1000 SRs
This is an example
document. This
document means
something
This second document
represents something
else
[1, 2, 0, 1, 1, 1, 0, 0, 1, 2]
[0, 1, 1, 0, 0, 0, 1, 1, 1, 1]
[ 0.34, 0.48, 0. , 0.34 …]
[ 0. , 0.33, 0.47, 0. …]

What's hot

Fully Utilizing Spark for Data ValidationDatabricks

(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services

AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)Amazon Web Services

Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Amazon Web Services

A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech TalksAmazon Web Services

ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...Amazon Web Services

Journey Towards Scaling Your Application to Million UsersAdrian Hornsby

Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services

What's hot (8)

Fully Utilizing Spark for Data Validation

(BDT208) A Technical Introduction to Amazon Elastic MapReduce

AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)

Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...

A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks

ATC303-Cache Me If You Can Minimizing Latency While Optimizing Cost Through A...

Journey Towards Scaling Your Application to Million Users

Deep Dive - Amazon Elastic MapReduce (EMR)

Viewers also liked

Certificate of CompletionsStephen Saade

Getting started with one driveofficialNazran

Assignment Prime Australia Sample on Employability SkillsAdam Jackson

How to turn your expert analysis into exceptional reportsJacob Funnell

ResumeTodd Donaldson

resume updated.amadeo abao

final reportvivek27594

Sample Assignment on Leadership & Management DevelopmentAdam Jackson

Viewers also liked (8)

Certificate of Completions

Getting started with one drive

Assignment Prime Australia Sample on Employability Skills

How to turn your expert analysis into exceptional reports

Resume

resume updated.

final report

Sample Assignment on Leadership & Management Development

Similar to 2016 Cymer Intern

Building Large Arabic Multi-Domain Resources for Sentiment Analysis Hady Elsahar

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks

Search BasicsSander Kieft

Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes

Candidate selection tutorialYiqun Liu

SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...Aman Grover

MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB

Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)

apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...apidays

Deep Learning Automated HelpdeskPranav Sharma

Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit

Webinar : Nouveautés de MongoDB 3.2MongoDB

Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini

Webminar - Novedades de MongoDB 3.2Sam_Francis

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks

Natural Language Processing (NLP) for Requirements Engineering (RE): an Overviewalessio_ferrari

LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...multimediaeval

Triantafyllia VoulibasiISSEL

NEW LAUNCH! Natural Language Processing for Data Analytics - MCL343 - re:Inve...Amazon Web Services

The recommendations system for source code components retrievalAYESHA JAVED

Similar to 2016 Cymer Intern (20)

Building Large Arabic Multi-Domain Resources for Sentiment Analysis

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...

Search Basics

Dice.com Bay Area Search - Beyond Learning to Rank Talk

Candidate selection tutorial

SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...

MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...

Extending Solr: Building a Cloud-like Knowledge Discovery Platform

apidays Australia 2023 - How We Built Our Generative AI Assistant: New Relic ...

Deep Learning Automated Helpdesk

Improving Search in Workday Products using Natural Language Processing

Webinar : Nouveautés de MongoDB 3.2

Towards a Quality Assessment of Web Corpora for Language Technology Applications

Webminar - Novedades de MongoDB 3.2

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...

Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview

LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...

Triantafyllia Voulibasi

NEW LAUNCH! Natural Language Processing for Data Analytics - MCL343 - re:Inve...

The recommendations system for source code components retrieval

2016 Cymer Intern

2. Slide 2 Educational Background Education: • Bachelor of Science in Computer Science • Georgia Institute of Technology • Expected graduation date May 2019 • Big Data Club: entity tagging news sources

3. Project Goals Slide 3 Objective: To build a text mining model which indicates when the rate of top keywords changes or when a new keyword emerges. Background: • Service Request (SR) is generated whenever an FSE works on a laser • Some SRs do not replace any part • SR’s main free bodies of text are: Customer Description, Problem Found, Task Description, and Resolution

4. Project Goal Slide 4 SR EWI Analysis EWI Text Mining SR Part Replacement Automated Monitoring

5. Data Pipeline Slide 5 User filters which SRs to process Extract SR’s text • Customer Description • Problem Found • Task Description Tokenize, group, and stem text • MO PRA -> MO_PRA • OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ Replace similar words • NEON, NE -> NE • SOFTWARE, SW, SOFT WARE -> SW Remove weak words • Dates • Numbers • Stopwords: AND, IS, BUT • Selected words: DUE END GROUP Save SR number with terms Calculate frequency Display in Spotfire

6. Pre-Processing Tokenize • Example: “DOSE ERROR COMMUNICATION …” • Result: [“DOSE”,”ERROR”, “COMMUNICATION”…] Group • Some words mean more as a group • [“DOSE_ERROR”, “ERROR_COMMUNICATION”…] Stem • Many words mean roughly the same thing • Optimizing, optimized, optimal, optimize all become optimiz © 2016 Cymer, LLC 6 User filters which SRs to process Extract SR’s text • Customer Description • Problem Found • Task Description Tokenize, group, and stem text • MO PRA -> MO_PRA • OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ Replace similar words • NEON, NE -> NE • SOFTWARE, SW, SOFT WARE -> SW Remove weak words • Dates • Numbers • Stopwords: AND, IS, BUT • Selected words: DUE END GROUP Save SR number with terms Calculate frequency Display in Spotfire

7. Replace Stemming doesn’t handle all derivations of a word • NEON, NE -> NE • SOFTWARE, SW, SFOT_WARE -> SW Hand selection of similar words Deep learning spell correction • Not all words in SR have a dictionary spelling • Find similarly used words according to word2vec (Python API) • Compare spelling according to Levenshtein Distance © 2016 Cymer, LLC 7 User filters which SRs to process Extract SR’s text • Customer Description • Problem Found • Task Description Tokenize, group, and stem text • MO PRA -> MO_PRA • OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ Replace similar words • NEON, NE -> NE • SOFTWARE, SW, SOFT WARE -> SW Remove weak words • Dates • Numbers • Stopwords: AND, IS, BUT • Selected words: DUE END GROUP Save SR number with terms Calculate frequency Display in Spotfire

8. Remove Slide 8 Not all text adds meaning to the analysis • Dates • Numbers • Stopwords • Regex Hand selected words that should be removed: GROUP, END Words only to be used in pairs: INCREASE, MO User filters which SRs to process Extract SR’s text • Customer Description • Problem Found • Task Description Tokenize, group, and stem text • MO PRA -> MO_PRA • OPTIMIZING, OPTIMIZED, OPTIMIZES -> OPTIMIZ Replace similar words • NEON, NE -> NE • SOFTWARE, SW, SOFT WARE -> SW Remove weak words • Dates • Numbers • Stopwords: AND, IS, BUT • Selected words: DUE END GROUP Save SR number with terms Calculate frequency Display in Spotfire

9. Methodology Slide 9 Recurring Keywords: • Python script embedded in Spotfire • Each word stored once for overall usage and once for its given month • Word maps to a unique set of SRs that the word is used in • Number of total and monthly SRs are kept Emerging Trends: • R script embedded in Spotfire • Hypergeometric test compares the most recent two months • Same statistical test used for EWI

10. Project Outcomes Slide 10 Created Spotfire Dashboard: • Pulls data from SQL • Processes data with R and Python • Interactive display SR Script

11. Text Mining Extension: Background Slide 11 Reliability manually classifies SRs into ~30 categories • Each SR takes about 1 min • Classifying SRs related to XL Immersion • 13,063 classified SRs to date Objective: To create and train a model that predicts the category for a given SR.

12. Text Mining Extension: Methodology Slide 12 Methodology • Count term usage • TF-IDF: Term frequency – inverse document frequency • Train an SVM classifier against pre-categorized SRs Achieved 75% accuracy using training set of 12000 SRs and testing set of 1000 SRs This is an example document. This document means something This second document represents something else [1, 2, 0, 1, 1, 1, 0, 0, 1, 2] [0, 1, 1, 0, 0, 0, 1, 1, 1, 1] [ 0.34, 0.48, 0. , 0.34 …] [ 0. , 0.33, 0.47, 0. …]

Editor's Notes

MO EFFICIENCY ISSUES Efficiency becomes effici Usefulness of model is based in large part on the replacement of similar words and removal of useless words
Success of model largely based on strong filter words

2016 Cymer Intern

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Viewers also liked

Viewers also liked (8)

Similar to 2016 Cymer Intern

Similar to 2016 Cymer Intern (20)

2016 Cymer Intern

Editor's Notes