Abstract: It is common for government and public datasets to include narrative fields, such as inspection reports, incident reporting, surveys, 911 calls, fire response, etc. In addition to categorical fields, such as datetime, location, demographics, these datasets tend to include a narrative description (e.g., what happened). It is typically in the narrative field that the most interesting data resides for the purpose of classifying. The problem, is that since the narrative is human interpreted and entered, each entry may be unique and if we use the whole entry as a single value, one will end up with an overfitted model that works only on the training data.
In this presentation, I will cover how natural language processing techniques are used to convert narrative fields into categorical data.
Level: Intermediate
Requirements: One should know basics of linear regression models. No prior programming knowledge is required.
Unraveling Multimodality with Large Language Models.pdf
Natural Language Provessing - Handling Narrarive Fields in Datasets for Classification
1. Handling Narrative Fields in Datasets
for Classification
Portland Data Science Group
Created by Andrew Ferlitsch
Community Outreach Officer
May, 2017
3. Feature Reduction
• Filter out Garbage (dirty data)
• Filter out Noise (non-relevant features)
• Goal = Low Bias, Low Variance
Data
+
Noise
+
Garbage
Relevant
Data
Only
Information Gain
Reduce Entropy
4. Dataset with Narrative Fields
Feature 1 Feature 2 Feature 3 Feature 4 Label
real-value real-value narrative categorical-value category
real-value real-value narrative categorical-value category
real-value real-value narrative categorical-value category
Narrative is plain text which is a human description of the entry, i.e., what happened.
“upon arrival, the individual was initially non-responsive. …”
Category (label) is a classification based on the narrative by a human interpretation.
012 // Code value for “coarse” category
5. Problem with Narrative Text Fields
• Examples: 911 calls,
Police/Emergency/Medical, Incidents,
Inspections, Surveys, Complaints, Reviews
– Human Entered
– Human Interpreted => Categorizing
– Different People Entering and Categorizing
– Non-Uniformity
– Human Errors
6. Challenge
• Convert Narrative Fields into Features with
Categorical ( or preferably Real) Values.
Data
+
Narrative
Data
+
Categorical / Real
Values
7. Bag of Words
Bag of Words
Narrative Field
• Unordered List of Words
• Convert Unique Words in
Categorical Variables
• Set 1 if word appears in
narrative; otherwise set 0.
8. Cleansing and Tokenize (Words)
• Remove Punctuation
• Expand Contractions (e.g., isn’t -> is not)
• Lowercase
The quick brown fox jumped over the lazy dog.
the:2
quick:1
brown:1
fox:1
Jumped:1
over:1
lazy:1
dog:1
9. Narrative as Categorical Variables
The quick brown fox jumped over the lazy dog.
The dog barked while the cat was jumping.
the quick brow
n
fox jump
ed
over lazy dog bark
ed
while cat was jum
ping
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1
Issues: Explosion of categorical variables. For example, if the dataset
has 80,000 unique words, then you would have 80,000 categorical variables!
10. Corpus
• A collection of related documents.
• The Narratives in the Dataset are the Corpus.
• Each Narrative is a Document
Feature 1 .. N Narrative Label
CORPUS
Document
11. Word Distribution
• Make a pass through all the narratives (corpus) building a dictionary.
• Sort by Word Frequency (number of times it occurs).
0
MAX
Upper Threshold
Lower Threshold
Useless Words – Have no significance (e.g. the)
Commonly used Words
Rare Words or Misspellings
12. Stop Word Removal
• Remove Highest Frequency Words (above upper threshold), and
• Remove Lowest Frequency Words (below lower threshold) (optional).
The quick brown fox jumped over the lazy dog.
The dog barked while the cat was jumping.
quick brown fox jumped lazy dog barked cat jumpin
g
1 1 1 1 1 1
1 1 1 1
Well known predefined Stop Word Lists – most widely used is the Porter List
13. Stemming
• Stemming – Reduce words to their root stem.
Ex. Jumped, jumping, jumps => jump
• Does not use predefined dictionary. Uses grammar ending rules.
jumped, jumping
barked
quick brown fox jump lazy dog bark cat
1 1 1 1 1 1
1 1 1 1
14. Lemmatization
• Stems are correct if word is not exception, BUT incorrect when
word is an exception.
Ex. something => someth
• Lemmatization means reducing words to their root form, but
correcting the exceptions by using a dictionary of common
exceptions (vs. all words, e.g., 1000 words instead of 100,000).
15. Term Frequency (TF)
• Issue: All words are weighted the same
• Term Frequency is weighting the frequency of the word
in the corpus, and using the frequency as its feature
value (vs. 1 or 0).
(no. of occurrences in corpus) / (no. of unique words in corpus)
quick brown fox jump lazy dog bark cat
0.001 0.003 0.0002 0.006 0.0001 0.007 0.0001 0.007
0.006 0.007 0.0001 0.007
16. Inverse Document Frequency (IDF)
• Issue: TF gives higher weight to words that are the most
frequently used – may result in underfitting (too general).
• Inverse Document Frequency is weighted words by
have rarely they appear in the corpus (assumption is
the word is more significant in a document).
log ((no. of unique words in corpus) / (no. of occurrences in corpus) )
quick brown fox jump lazy dog bark cat
2 1.5 2.7 1.2 3 1.15 3 1.15
1.2 1.15 3 1.15
17. Pruning
• Even with Stemming/Lemmatization, the feature matrix
will be massive in size (e.g, 30,000 features).
• Reduce to smaller number – typically 500 to 1000.
• Choose the highest TF or IDF values in the Corpus.
18. Advance Topic – Word Reduction
• Words that are part of a common grouping are replaced
with a root word for the group.
• Steps:
1. Stemming/Lemmatization
2. Lookup Root Word in Word Group Dictionary
3. If entry exists, replace with common root word for
the group.
Group Example: male: [ man, gentleman, boy, guy, dude ]
19. Advance Topic – Word Reduction
male : [ man, gentleman, boy, guy, dude ]
female: [ woman, lady, girl, gal ]
parent : [ father, mother, mom, mommy, dad, daddy ]
Word Root
man male
gentleman male
boy male
guy male
dude male
woman female
Lady Female
girl female
gal female
The mother played with the girls while the dad
prepared snacks for the ladies in mom’s reading group.
parent,
play,
female,
parent,
prepare,
snack,
female,
parent,
read,
group
20. Advance Topics – N-grams
• Instead of parsing the sentence into single words, each
as a feature, we group them in pairs (2-gram) or triplets
(3-grams), etc, ….
• Parameters:
1. Choose Window Size (2, 3, …)
2. Choose Stride Length (1, 2, …)
2-gram
word1 word2 word3 … word4
stride of 1 2-gram
21. Advance Topics – N-grams
The quick brown fox jumped over the lazy dog
quick, brown, fox, jump, lazy, dog
2-grams, stride of 1
quick, brown
brown, fox
fox, jump
jump, lazy
lazy, dog
Dog, <null>
quick,
brown
brown,
fox
fox,
jump
Jump,
lazy
Lazy,
dog
dog
1 1 1 1 1 1
22. More – Not Covered
• Word-Vectors [Word Embedding]
• Correcting Misspellings
• Detecting incorrectly categorized Narratives.
23. Final – Homegrown Tool
• I built a command tool for doing all the steps in this
presentation.
• Java based, packaged as a JAR file.
https://github.com/andrewferlitsch/Portland-Data-Science-Group/blob/master/README.NLP.md
24. Final – Homegrown Tool - Examples
• Quora question pairs (training set: 400,000)
java –jar nlp.jar –c3,4 train.csv
• Remove Step Words
java –jar nlp.jar –c3,4 -e p train.csv
• Lemma and Reduce to Common Root
java –jar nlp.jar –c3,4 -e p –l –r train.csv
• Lemma and Reduce to Common Root
java –jar nlp.jar –c3,4 -e p –l –r –F train.csv