Work with unstructured data such as natural language processing (NLP) is a fascinating and nuanced field within machine learning and data science. While NLP benefits from an abundance of tools and cutting-edge research, it can be a bit overwhelming when getting started with applying these methods. This presentation will discuss concepts useful to approaching an NLP project and will walk through an example application. It will also note a few best-practices that the presenter has found along the way.
Natural Language Processing - Principles and Practice - Gracie Diaz
1. Demo: Topic Modeling in
Natural Language Processing
WiDS Miami | March 4, 2019
Gracie Diaz | Royal Caribbean Cruises Ltd.
2. Agenda
• Data exploration: Amazon.com food product reviews
• Topic modeling via Latent Dirichlet Allocation (LDA)
• Preprocessing and its role in LDA topic quality
• Closing thoughts: Suggested LDA best-practices
3. Agenda
• Data exploration: Amazon.com food product reviews
• Topic modeling via Latent Dirichlet Allocation (LDA)
• Preprocessing and its role in LDA topic quality
• Closing thoughts: Suggested LDA best-practices
4. Source: Amazon Fine Food Reviews, “568,454 food reviews Amazon users left up to October
2012”, https://www.kaggle.com/snap/amazon-fine-food-reviews
Data: 500k+ food product reviews
from Amazon.com
10. documents
documents
→
→
→
terms
‘flavor’, ‘great’, ‘are’,
‘quality’
(common terms, likely
evenly-distributed across
documents about food)
‘saltwater’ [taffy]
(term more likely in
documents specifically
about taffy)
‘canned’ [dog
food]
(term more likely in
documents specifically
about dog food)
‘salted’ [peanuts]
(term more likely in
documents specifically
about peanuts)
sparse matrix is result of
vectorization of documents’
preprocessed terms
(more on this next!)
(sparse)
matrix of
counts
Topic “factor”
in document-term relationships
11. terms
documents
topics
documents
0 (taffy)
1 (dog food)
2 (other)
terms
‘flavor’, ‘great’, ‘are’,
‘quality’
(common terms, likely
evenly-distributed across
topics or documents about
food)
‘saltwater’ [taffy]
(term more likely in
topics about taffy)
‘canned’ [dog food]
(term more likely in
topics about dog food)
‘salted’ [peanuts]
(term more likely in
topics about peanuts)
→
→
→
Topic “factor”
in document-term relationships
12. Agenda
• Data exploration: Amazon.com food product reviews
• Topic modeling via Latent Dirichlet Allocation (LDA)
• Preprocessing terms and its role in LDA topic quality
• Closing thoughts: Suggested LDA best-practices
13. What influences LDA models’ topic “quality”?
• Number of topics selected – all high quality topics, or only some?
• Random seed – results are vulnerable to local optima
• Term preprocessing – steps might include:
• standardization
• lemmatization
• parts-of-speech (POS) filtering
• n-grams
15. Demo agenda
• Data exploration: Amazon.com food product reviews
• Topic modeling via Latent Dirichlet Allocation (LDA)
• Preprocessing and its role in LDA topic quality
• Closing thoughts: Suggested LDA best-practices
16. Suggested LDA best-practices
• Go “Big” or Go Home!
• Number of documents (minimize pre-filtering)
• Number of topics (don’t be afraid of going big, like 50+, 100+)
• Model passes/iterations (check docs)
• Modify recipe to taste
• Change preprocessing steps or order
• Adjust parameters
19. Supervised vs. Unsupervised Algorithms
Image sources: https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d
Machine
Learning
Supervised
• Labeled data
• Predict outcome
Unsupervised
• No labels
• Find hidden structure
Latent Dirichlet
Allocation (LDA)
20. Topic modeling: An unsupervised problem
Corpus Topics
Doc Doc
Term
Term
Term
Term-Topic
Relationship
Docs
Generically: observations, records, rows dataset fitted model
21. terms
terms
topics
documents
documents
≈
The Topic “Factor”
0 (taffy)
1 (dog food)
2 (other)
Common terms,
likely evenly-
distributed across
topics OR documents
saltwater
taffy
(more
likely in
topic 0) canned
dog food
(more
likely in
topic 1)
salted peanuts
(more likely in
topic 2)
22. terms termstopics
topics
documents
documents
(sparse)
matrix of
counts
≈
𝑋 ≈ 𝑊𝐻
(Non-negative Matrix Factorization – NMF)
set of
documents
--> corpus
set of terms --> dictionary
sparse matrix is result of
“vectorization” of
documents’ term counts
Model is term-document matrix
factored into:
Document-Topic weights, and
Topic-Term weights
Structure of LDA
document-term-topic
relationships
25. Players in topic modeling
• term – a word or word token (could be n-gram, lemma, or stem)
• Examples: “plane”, “very”, “be_available”, “fault_line”
• In machine learning this would be: one of the features on which a model trains
• document – a collection of words or sentences that has a real-world purpose/context for its existence
• Examples: an email, a book or one of its chapters, a social media post, a journalistic article, etc.
• In machine learning this would be: a single piece of data, like a “row”, “record”, “observation”
• corpus – the set of documents parsed for analysis (plural: corpora)
• Examples: correspondence, book collection, articles
• In machine learning this would be: the overall dataset of rows and observations
• dictionary – the numeric id of each of the corpus’s preprocessed “words”
• Example: for “I am orange, but I am not purple” words are: 1: I, 2: am, 3: orange, 4: but, 5: not, 6: purple
• In machine learning this would be: the set of features
• vectorization – the stats of the “words” in the documents (i.e. term frequency)
• ”Document 001 has 2 instances of the word identified by the number 2”
• In machine learning this would be: how each row stacks up in terms of each feature