Practical Natural language processing
(NLP)
Overview
1. Intro & overview 2. NLP task & tools 3. NLP Use case sharing 4. Impact,
Lesson learned
What is Natural Language Processing? Why we need it?
Explosive growth of unstructured data
Better understanding on user interaction & behavior
Example NLP use case
Social media monitoring
Recommendation engine
NLP in finance & credit scoring
Other use case
Language is complicated…
Language is complicated
Buffalo…. this is a correct sentence
Two interpretation of same sentence
I saw a girl with telescope
Natural language processing (NLP) task
stemming stopwords Word segment Part-of-speech
Name entity
recognition
Abbrevation Ambigious term Word similarity Translation
Python package: spacy, NLTK
1. Stemmer
Simplify words to root word
• Affect / Affection / Affections /Affected /Affecting  Affect
2. Remove Stopwords
Remove common words and word with little meaning
• I, am, it, she ,he, want, do…..
3. Handle ambigious term
word with potential multiple meanings
● “I love Blackberry”  Fruit or mobile phone?
● Java  Programming language or Indonesian island?
4. Handle abbreviation
a shortened form of a word or phrase
● HDB/ MIC / NTUC / PM/ CV
5. Name entity extraction (NER)
Python package: spacy, NLTK
6. word2vec ( word embedding)
e.g. semantic similarity, what are the words that have similar meaning of given word/phrase
Python package: genism, tensorflow
Search: Java developer ( what is the word semantic similar to?)
7. Wordcloud
Other challenge in text processing
● Language library  Malay & Tamils language are not supported yet
by most service provider
● Spelling mistakes
● Language translation and mapping
● Contextual meaning
● Informal language handling
● Sarcassim
Use case:
Build a job classification engine using NLP
(with human-in-loop design)
1a) Business objective
To understand job market demand & supply, location,
skillset needed for each profession & role
1b) Overview
2) Key information extraction
1) Classification
Classifier 1 – MASCO job category
Classifier 2 – MSIC industry
Classifier 3 – NEC field of study
Classifier 4 - SKILL library
• To classify a job post into MASCO job category (6000+ categories !!)
• To classify a job post into MSIC industry category ( 300+ categories )
• To classify a job post into NEC field of study category (100+ categories)
• To extract relevant skill that match to SKILL library ( 2000+ categories)
6000+ MASCO category
1c) Problem framing and solution approach
1. Business objective: To classify a job post into MASCO job category (classification problem, text &
language)
2. Input data: Job title + Job description ( very sparse & text-based data)
3. Output category: 6000+ MASCO category ( even google NLP API only able to categorize 300+)
4. Selected algorithm/ model: word2vec semantic similarity
5. Other alternative model: custom deep learning model like BERT, LSTM (but wait… it take 3~4
months to finetune and pray hard for the accuracy!!!)
6. Other challenge: Limited time ( < 2 months) and resource ( 1 Data scientist , 1 Data engineer)
Let’s test on Google world leading NLP API
2a) Data preparation & standardization
3a) Technology architecture overview
3b) Data engineering & workflow
4a) Text preprocessing
4b) Information extraction
4c) Train word2vec model
4d) tensorboard to visualize word2vec model
5a) Classification ( exact-match)
MASCO job category
NEC field of study
5b) Classification ( semantic-match)
MASCO job category
6a) Evaluation (human-in-loop-design)
7) Impact & Benefit
1) Faster discovery of job
market insight & trend
Improve time data-to-decision from months to
day
2) Automate 90% of manual work
50,000+ job post auto-classified per month
300+ man-hours saved per month ( 2 two head-count)
8a) Business challenge & lesson learned
● Get buy-in early - Identify all your stakeholders and involve them
since project initiation ( Avoid i build first, they will come later)
● Goal & impact oriented - Understand what matter to the
organization & identify high impact and low-hanging fruit use case
● Be Agile – Start with simple model, build first prototype, get
feedback and run iteration
8b) Technical challenge & lesson learned
● Be realistic – Data is not always came in “perfect” structures as per
your wish list!
● Technology/Technical gap - in your organization i.e. legacy
systems & integration
● Performance - Architecture & pipelines to solve performance i.e.
increment of concurrent users
● End-to-end solution mindset - You need knowledge in software
infrastructures, development pipeline & deployment!
Questions?
Thank you
lutherteh0204@gmail.com
NLP & NLU

Practical Natural language processing

  • 1.
  • 2.
    Overview 1. Intro &overview 2. NLP task & tools 3. NLP Use case sharing 4. Impact, Lesson learned
  • 3.
    What is NaturalLanguage Processing? Why we need it?
  • 4.
    Explosive growth ofunstructured data
  • 5.
    Better understanding onuser interaction & behavior
  • 6.
  • 7.
  • 8.
  • 9.
    NLP in finance& credit scoring
  • 10.
  • 11.
  • 12.
  • 13.
    Buffalo…. this isa correct sentence
  • 14.
    Two interpretation ofsame sentence I saw a girl with telescope
  • 18.
    Natural language processing(NLP) task stemming stopwords Word segment Part-of-speech Name entity recognition Abbrevation Ambigious term Word similarity Translation Python package: spacy, NLTK
  • 19.
    1. Stemmer Simplify wordsto root word • Affect / Affection / Affections /Affected /Affecting  Affect
  • 20.
    2. Remove Stopwords Removecommon words and word with little meaning • I, am, it, she ,he, want, do…..
  • 21.
    3. Handle ambigiousterm word with potential multiple meanings ● “I love Blackberry”  Fruit or mobile phone? ● Java  Programming language or Indonesian island?
  • 22.
    4. Handle abbreviation ashortened form of a word or phrase ● HDB/ MIC / NTUC / PM/ CV
  • 23.
    5. Name entityextraction (NER) Python package: spacy, NLTK
  • 24.
    6. word2vec (word embedding) e.g. semantic similarity, what are the words that have similar meaning of given word/phrase Python package: genism, tensorflow Search: Java developer ( what is the word semantic similar to?)
  • 25.
  • 26.
    Other challenge intext processing ● Language library  Malay & Tamils language are not supported yet by most service provider ● Spelling mistakes ● Language translation and mapping ● Contextual meaning ● Informal language handling ● Sarcassim
  • 27.
    Use case: Build ajob classification engine using NLP (with human-in-loop design)
  • 28.
    1a) Business objective Tounderstand job market demand & supply, location, skillset needed for each profession & role
  • 29.
    1b) Overview 2) Keyinformation extraction 1) Classification Classifier 1 – MASCO job category Classifier 2 – MSIC industry Classifier 3 – NEC field of study Classifier 4 - SKILL library
  • 30.
    • To classifya job post into MASCO job category (6000+ categories !!) • To classify a job post into MSIC industry category ( 300+ categories ) • To classify a job post into NEC field of study category (100+ categories) • To extract relevant skill that match to SKILL library ( 2000+ categories)
  • 31.
  • 32.
    1c) Problem framingand solution approach 1. Business objective: To classify a job post into MASCO job category (classification problem, text & language) 2. Input data: Job title + Job description ( very sparse & text-based data) 3. Output category: 6000+ MASCO category ( even google NLP API only able to categorize 300+) 4. Selected algorithm/ model: word2vec semantic similarity 5. Other alternative model: custom deep learning model like BERT, LSTM (but wait… it take 3~4 months to finetune and pray hard for the accuracy!!!) 6. Other challenge: Limited time ( < 2 months) and resource ( 1 Data scientist , 1 Data engineer)
  • 33.
    Let’s test onGoogle world leading NLP API
  • 35.
    2a) Data preparation& standardization
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
    4d) tensorboard tovisualize word2vec model
  • 42.
    5a) Classification (exact-match) MASCO job category NEC field of study
  • 43.
    5b) Classification (semantic-match) MASCO job category
  • 44.
  • 45.
    7) Impact &Benefit 1) Faster discovery of job market insight & trend Improve time data-to-decision from months to day 2) Automate 90% of manual work 50,000+ job post auto-classified per month 300+ man-hours saved per month ( 2 two head-count)
  • 46.
    8a) Business challenge& lesson learned ● Get buy-in early - Identify all your stakeholders and involve them since project initiation ( Avoid i build first, they will come later) ● Goal & impact oriented - Understand what matter to the organization & identify high impact and low-hanging fruit use case ● Be Agile – Start with simple model, build first prototype, get feedback and run iteration
  • 47.
    8b) Technical challenge& lesson learned ● Be realistic – Data is not always came in “perfect” structures as per your wish list! ● Technology/Technical gap - in your organization i.e. legacy systems & integration ● Performance - Architecture & pipelines to solve performance i.e. increment of concurrent users ● End-to-end solution mindset - You need knowledge in software infrastructures, development pipeline & deployment!
  • 48.
  • 49.
  • 51.