SlideShare a Scribd company logo
1 of 24
Handling Narrative Fields in Datasets
for Classification
Portland Data Science Group
Created by Andrew Ferlitsch
Community Outreach Officer
May, 2017
Typical Dataset
Feature 1 Feature 2 Feature 3 Feature 4 Label
real-value real-value real-value categorical-value category
real-value real-value real-value categorical-value category
real-value real-value real-value categorical-value category
Dataset Clean
Categorical
Value
Conversion
Feature
Scaling
Progression in Dataset Preparation
Feature Reduction
• Filter out Garbage (dirty data)
• Filter out Noise (non-relevant features)
• Goal = Low Bias, Low Variance
Data
+
Noise
+
Garbage
Relevant
Data
Only
Information Gain
Reduce Entropy
Dataset with Narrative Fields
Feature 1 Feature 2 Feature 3 Feature 4 Label
real-value real-value narrative categorical-value category
real-value real-value narrative categorical-value category
real-value real-value narrative categorical-value category
Narrative is plain text which is a human description of the entry, i.e., what happened.
“upon arrival, the individual was initially non-responsive. …”
Category (label) is a classification based on the narrative by a human interpretation.
012 // Code value for “coarse” category
Problem with Narrative Text Fields
• Examples: 911 calls,
Police/Emergency/Medical, Incidents,
Inspections, Surveys, Complaints, Reviews
– Human Entered
– Human Interpreted => Categorizing
– Different People Entering and Categorizing
– Non-Uniformity
– Human Errors
Challenge
• Convert Narrative Fields into Features with
Categorical ( or preferably Real) Values.
Data
+
Narrative
Data
+
Categorical / Real
Values
Bag of Words
Bag of Words
Narrative Field
• Unordered List of Words
• Convert Unique Words in
Categorical Variables
• Set 1 if word appears in
narrative; otherwise set 0.
Cleansing and Tokenize (Words)
• Remove Punctuation
• Expand Contractions (e.g., isn’t -> is not)
• Lowercase
The quick brown fox jumped over the lazy dog.
the:2
quick:1
brown:1
fox:1
Jumped:1
over:1
lazy:1
dog:1
Narrative as Categorical Variables
The quick brown fox jumped over the lazy dog.
The dog barked while the cat was jumping.
the quick brow
n
fox jump
ed
over lazy dog bark
ed
while cat was jum
ping
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1
Issues: Explosion of categorical variables. For example, if the dataset
has 80,000 unique words, then you would have 80,000 categorical variables!
Corpus
• A collection of related documents.
• The Narratives in the Dataset are the Corpus.
• Each Narrative is a Document
Feature 1 .. N Narrative Label
CORPUS
Document
Word Distribution
• Make a pass through all the narratives (corpus) building a dictionary.
• Sort by Word Frequency (number of times it occurs).
0
MAX
Upper Threshold
Lower Threshold
Useless Words – Have no significance (e.g. the)
Commonly used Words
Rare Words or Misspellings
Stop Word Removal
• Remove Highest Frequency Words (above upper threshold), and
• Remove Lowest Frequency Words (below lower threshold) (optional).
The quick brown fox jumped over the lazy dog.
The dog barked while the cat was jumping.
quick brown fox jumped lazy dog barked cat jumpin
g
1 1 1 1 1 1
1 1 1 1
Well known predefined Stop Word Lists – most widely used is the Porter List
Stemming
• Stemming – Reduce words to their root stem.
Ex. Jumped, jumping, jumps => jump
• Does not use predefined dictionary. Uses grammar ending rules.
jumped, jumping
barked
quick brown fox jump lazy dog bark cat
1 1 1 1 1 1
1 1 1 1
Lemmatization
• Stems are correct if word is not exception, BUT incorrect when
word is an exception.
Ex. something => someth
• Lemmatization means reducing words to their root form, but
correcting the exceptions by using a dictionary of common
exceptions (vs. all words, e.g., 1000 words instead of 100,000).
Term Frequency (TF)
• Issue: All words are weighted the same
• Term Frequency is weighting the frequency of the word
in the corpus, and using the frequency as its feature
value (vs. 1 or 0).
(no. of occurrences in corpus) / (no. of unique words in corpus)
quick brown fox jump lazy dog bark cat
0.001 0.003 0.0002 0.006 0.0001 0.007 0.0001 0.007
0.006 0.007 0.0001 0.007
Inverse Document Frequency (IDF)
• Issue: TF gives higher weight to words that are the most
frequently used – may result in underfitting (too general).
• Inverse Document Frequency is weighted words by
have rarely they appear in the corpus (assumption is
the word is more significant in a document).
log ((no. of unique words in corpus) / (no. of occurrences in corpus) )
quick brown fox jump lazy dog bark cat
2 1.5 2.7 1.2 3 1.15 3 1.15
1.2 1.15 3 1.15
Pruning
• Even with Stemming/Lemmatization, the feature matrix
will be massive in size (e.g, 30,000 features).
• Reduce to smaller number – typically 500 to 1000.
• Choose the highest TF or IDF values in the Corpus.
Advance Topic – Word Reduction
• Words that are part of a common grouping are replaced
with a root word for the group.
• Steps:
1. Stemming/Lemmatization
2. Lookup Root Word in Word Group Dictionary
3. If entry exists, replace with common root word for
the group.
Group Example: male: [ man, gentleman, boy, guy, dude ]
Advance Topic – Word Reduction
male : [ man, gentleman, boy, guy, dude ]
female: [ woman, lady, girl, gal ]
parent : [ father, mother, mom, mommy, dad, daddy ]
Word Root
man male
gentleman male
boy male
guy male
dude male
woman female
Lady Female
girl female
gal female
The mother played with the girls while the dad
prepared snacks for the ladies in mom’s reading group.
parent,
play,
female,
parent,
prepare,
snack,
female,
parent,
read,
group
Advance Topics – N-grams
• Instead of parsing the sentence into single words, each
as a feature, we group them in pairs (2-gram) or triplets
(3-grams), etc, ….
• Parameters:
1. Choose Window Size (2, 3, …)
2. Choose Stride Length (1, 2, …)
2-gram
word1 word2 word3 … word4
stride of 1 2-gram
Advance Topics – N-grams
The quick brown fox jumped over the lazy dog
quick, brown, fox, jump, lazy, dog
2-grams, stride of 1
quick, brown
brown, fox
fox, jump
jump, lazy
lazy, dog
Dog, <null>
quick,
brown
brown,
fox
fox,
jump
Jump,
lazy
Lazy,
dog
dog
1 1 1 1 1 1
More – Not Covered
• Word-Vectors [Word Embedding]
• Correcting Misspellings
• Detecting incorrectly categorized Narratives.
Final – Homegrown Tool
• I built a command tool for doing all the steps in this
presentation.
• Java based, packaged as a JAR file.
https://github.com/andrewferlitsch/Portland-Data-Science-Group/blob/master/README.NLP.md
Final – Homegrown Tool - Examples
• Quora question pairs (training set: 400,000)
java –jar nlp.jar –c3,4 train.csv
• Remove Step Words
java –jar nlp.jar –c3,4 -e p train.csv
• Lemma and Reduce to Common Root
java –jar nlp.jar –c3,4 -e p –l –r train.csv
• Lemma and Reduce to Common Root
java –jar nlp.jar –c3,4 -e p –l –r –F train.csv

More Related Content

Similar to Natural Language Provessing - Handling Narrarive Fields in Datasets for Classification

maximum parsimony.pdf
maximum parsimony.pdfmaximum parsimony.pdf
maximum parsimony.pdfSrimathideviJ
 
Mixed Effects Models - Crossed Random Effects
Mixed Effects Models - Crossed Random EffectsMixed Effects Models - Crossed Random Effects
Mixed Effects Models - Crossed Random EffectsScott Fraundorf
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)Nick Hathaway
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 documentUma Kant
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...wltrimbl
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalcaptainmactavish1996
 
Lazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerLazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerSho Fola Soboyejo
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
IRMNG presentation March 2012
IRMNG presentation March 2012IRMNG presentation March 2012
IRMNG presentation March 2012Tony Rees
 
SPTechCon - Taxonomy, Content Types and Metadata - Boston - August 12 2013
SPTechCon - Taxonomy, Content Types and Metadata - Boston - August 12 2013SPTechCon - Taxonomy, Content Types and Metadata - Boston - August 12 2013
SPTechCon - Taxonomy, Content Types and Metadata - Boston - August 12 2013Ruven Gotz
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectorsOsebe Sammi
 

Similar to Natural Language Provessing - Handling Narrarive Fields in Datasets for Classification (20)

maximum parsimony.pdf
maximum parsimony.pdfmaximum parsimony.pdf
maximum parsimony.pdf
 
Mixed Effects Models - Crossed Random Effects
Mixed Effects Models - Crossed Random EffectsMixed Effects Models - Crossed Random Effects
Mixed Effects Models - Crossed Random Effects
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)
 
HackYale NLP Week 0
HackYale NLP Week 0HackYale NLP Week 0
HackYale NLP Week 0
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 document
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
NLP PPT.pptx
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
Lazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerLazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text Summarizer
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
IRMNG presentation March 2012
IRMNG presentation March 2012IRMNG presentation March 2012
IRMNG presentation March 2012
 
SPTechCon - Taxonomy, Content Types and Metadata - Boston - August 12 2013
SPTechCon - Taxonomy, Content Types and Metadata - Boston - August 12 2013SPTechCon - Taxonomy, Content Types and Metadata - Boston - August 12 2013
SPTechCon - Taxonomy, Content Types and Metadata - Boston - August 12 2013
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
LSA algorithm
LSA algorithmLSA algorithm
LSA algorithm
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
 
Nlp
NlpNlp
Nlp
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
NLP new words
NLP new wordsNLP new words
NLP new words
 

More from Andrew Ferlitsch

Pareto Principle Applied to QA
Pareto Principle Applied to QAPareto Principle Applied to QA
Pareto Principle Applied to QAAndrew Ferlitsch
 
Whiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonWhiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonAndrew Ferlitsch
 
Object Oriented Programming Principles
Object Oriented Programming PrinciplesObject Oriented Programming Principles
Object Oriented Programming PrinciplesAndrew Ferlitsch
 
Python - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter NotepadPython - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter NotepadAndrew Ferlitsch
 
Natural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) GenerationNatural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) GenerationAndrew Ferlitsch
 
Machine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural NetworksMachine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural NetworksAndrew Ferlitsch
 
Machine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksMachine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksAndrew Ferlitsch
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksAndrew Ferlitsch
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesAndrew Ferlitsch
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixAndrew Ferlitsch
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsAndrew Ferlitsch
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear RegressionAndrew Ferlitsch
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear RegressionAndrew Ferlitsch
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionAndrew Ferlitsch
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsAndrew Ferlitsch
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowAndrew Ferlitsch
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningAndrew Ferlitsch
 

More from Andrew Ferlitsch (20)

AI - Intelligent Agents
AI - Intelligent AgentsAI - Intelligent Agents
AI - Intelligent Agents
 
Pareto Principle Applied to QA
Pareto Principle Applied to QAPareto Principle Applied to QA
Pareto Principle Applied to QA
 
Whiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonWhiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in Python
 
Object Oriented Programming Principles
Object Oriented Programming PrinciplesObject Oriented Programming Principles
Object Oriented Programming Principles
 
Python - OOP Programming
Python - OOP ProgrammingPython - OOP Programming
Python - OOP Programming
 
Python - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter NotepadPython - Installing and Using Python and Jupyter Notepad
Python - Installing and Using Python and Jupyter Notepad
 
Natural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) GenerationNatural Language Processing - Groupings (Associations) Generation
Natural Language Processing - Groupings (Associations) Generation
 
Machine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural NetworksMachine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Recurrent Neural Networks
 
Machine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural NetworksMachine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Convolutional Neural Networks
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural Networks
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble Methods
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear Regression
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to Tensorflow
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Recently uploaded

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Recently uploaded (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Natural Language Provessing - Handling Narrarive Fields in Datasets for Classification

  • 1. Handling Narrative Fields in Datasets for Classification Portland Data Science Group Created by Andrew Ferlitsch Community Outreach Officer May, 2017
  • 2. Typical Dataset Feature 1 Feature 2 Feature 3 Feature 4 Label real-value real-value real-value categorical-value category real-value real-value real-value categorical-value category real-value real-value real-value categorical-value category Dataset Clean Categorical Value Conversion Feature Scaling Progression in Dataset Preparation
  • 3. Feature Reduction • Filter out Garbage (dirty data) • Filter out Noise (non-relevant features) • Goal = Low Bias, Low Variance Data + Noise + Garbage Relevant Data Only Information Gain Reduce Entropy
  • 4. Dataset with Narrative Fields Feature 1 Feature 2 Feature 3 Feature 4 Label real-value real-value narrative categorical-value category real-value real-value narrative categorical-value category real-value real-value narrative categorical-value category Narrative is plain text which is a human description of the entry, i.e., what happened. “upon arrival, the individual was initially non-responsive. …” Category (label) is a classification based on the narrative by a human interpretation. 012 // Code value for “coarse” category
  • 5. Problem with Narrative Text Fields • Examples: 911 calls, Police/Emergency/Medical, Incidents, Inspections, Surveys, Complaints, Reviews – Human Entered – Human Interpreted => Categorizing – Different People Entering and Categorizing – Non-Uniformity – Human Errors
  • 6. Challenge • Convert Narrative Fields into Features with Categorical ( or preferably Real) Values. Data + Narrative Data + Categorical / Real Values
  • 7. Bag of Words Bag of Words Narrative Field • Unordered List of Words • Convert Unique Words in Categorical Variables • Set 1 if word appears in narrative; otherwise set 0.
  • 8. Cleansing and Tokenize (Words) • Remove Punctuation • Expand Contractions (e.g., isn’t -> is not) • Lowercase The quick brown fox jumped over the lazy dog. the:2 quick:1 brown:1 fox:1 Jumped:1 over:1 lazy:1 dog:1
  • 9. Narrative as Categorical Variables The quick brown fox jumped over the lazy dog. The dog barked while the cat was jumping. the quick brow n fox jump ed over lazy dog bark ed while cat was jum ping 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Issues: Explosion of categorical variables. For example, if the dataset has 80,000 unique words, then you would have 80,000 categorical variables!
  • 10. Corpus • A collection of related documents. • The Narratives in the Dataset are the Corpus. • Each Narrative is a Document Feature 1 .. N Narrative Label CORPUS Document
  • 11. Word Distribution • Make a pass through all the narratives (corpus) building a dictionary. • Sort by Word Frequency (number of times it occurs). 0 MAX Upper Threshold Lower Threshold Useless Words – Have no significance (e.g. the) Commonly used Words Rare Words or Misspellings
  • 12. Stop Word Removal • Remove Highest Frequency Words (above upper threshold), and • Remove Lowest Frequency Words (below lower threshold) (optional). The quick brown fox jumped over the lazy dog. The dog barked while the cat was jumping. quick brown fox jumped lazy dog barked cat jumpin g 1 1 1 1 1 1 1 1 1 1 Well known predefined Stop Word Lists – most widely used is the Porter List
  • 13. Stemming • Stemming – Reduce words to their root stem. Ex. Jumped, jumping, jumps => jump • Does not use predefined dictionary. Uses grammar ending rules. jumped, jumping barked quick brown fox jump lazy dog bark cat 1 1 1 1 1 1 1 1 1 1
  • 14. Lemmatization • Stems are correct if word is not exception, BUT incorrect when word is an exception. Ex. something => someth • Lemmatization means reducing words to their root form, but correcting the exceptions by using a dictionary of common exceptions (vs. all words, e.g., 1000 words instead of 100,000).
  • 15. Term Frequency (TF) • Issue: All words are weighted the same • Term Frequency is weighting the frequency of the word in the corpus, and using the frequency as its feature value (vs. 1 or 0). (no. of occurrences in corpus) / (no. of unique words in corpus) quick brown fox jump lazy dog bark cat 0.001 0.003 0.0002 0.006 0.0001 0.007 0.0001 0.007 0.006 0.007 0.0001 0.007
  • 16. Inverse Document Frequency (IDF) • Issue: TF gives higher weight to words that are the most frequently used – may result in underfitting (too general). • Inverse Document Frequency is weighted words by have rarely they appear in the corpus (assumption is the word is more significant in a document). log ((no. of unique words in corpus) / (no. of occurrences in corpus) ) quick brown fox jump lazy dog bark cat 2 1.5 2.7 1.2 3 1.15 3 1.15 1.2 1.15 3 1.15
  • 17. Pruning • Even with Stemming/Lemmatization, the feature matrix will be massive in size (e.g, 30,000 features). • Reduce to smaller number – typically 500 to 1000. • Choose the highest TF or IDF values in the Corpus.
  • 18. Advance Topic – Word Reduction • Words that are part of a common grouping are replaced with a root word for the group. • Steps: 1. Stemming/Lemmatization 2. Lookup Root Word in Word Group Dictionary 3. If entry exists, replace with common root word for the group. Group Example: male: [ man, gentleman, boy, guy, dude ]
  • 19. Advance Topic – Word Reduction male : [ man, gentleman, boy, guy, dude ] female: [ woman, lady, girl, gal ] parent : [ father, mother, mom, mommy, dad, daddy ] Word Root man male gentleman male boy male guy male dude male woman female Lady Female girl female gal female The mother played with the girls while the dad prepared snacks for the ladies in mom’s reading group. parent, play, female, parent, prepare, snack, female, parent, read, group
  • 20. Advance Topics – N-grams • Instead of parsing the sentence into single words, each as a feature, we group them in pairs (2-gram) or triplets (3-grams), etc, …. • Parameters: 1. Choose Window Size (2, 3, …) 2. Choose Stride Length (1, 2, …) 2-gram word1 word2 word3 … word4 stride of 1 2-gram
  • 21. Advance Topics – N-grams The quick brown fox jumped over the lazy dog quick, brown, fox, jump, lazy, dog 2-grams, stride of 1 quick, brown brown, fox fox, jump jump, lazy lazy, dog Dog, <null> quick, brown brown, fox fox, jump Jump, lazy Lazy, dog dog 1 1 1 1 1 1
  • 22. More – Not Covered • Word-Vectors [Word Embedding] • Correcting Misspellings • Detecting incorrectly categorized Narratives.
  • 23. Final – Homegrown Tool • I built a command tool for doing all the steps in this presentation. • Java based, packaged as a JAR file. https://github.com/andrewferlitsch/Portland-Data-Science-Group/blob/master/README.NLP.md
  • 24. Final – Homegrown Tool - Examples • Quora question pairs (training set: 400,000) java –jar nlp.jar –c3,4 train.csv • Remove Step Words java –jar nlp.jar –c3,4 -e p train.csv • Lemma and Reduce to Common Root java –jar nlp.jar –c3,4 -e p –l –r train.csv • Lemma and Reduce to Common Root java –jar nlp.jar –c3,4 -e p –l –r –F train.csv