Natural Language Processing - Principles and Practice - Gracie Diaz

Demo: Topic Modeling in
Natural Language Processing
WiDS Miami | March 4, 2019
Gracie Diaz | Royal Caribbean Cruises Ltd.

Agenda
• Data exploration: Amazon.com food product reviews
• Topic modeling via Latent Dirichlet Allocation (LDA)
• Preprocessing and its role in LDA topic quality
• Closing thoughts: Suggested LDA best-practices

Source: Amazon Fine Food Reviews, “568,454 food reviews Amazon users left up to October
2012”, https://www.kaggle.com/snap/amazon-fine-food-reviews
Data: 500k+ food product reviews
from Amazon.com

Topics we can see:
“taffy”

Topics we can see:
“dog food”

Topics we can see:
…… (other)

terms termstopics
topics
documents
documents
(sparse)
matrix of
counts
≈
𝑋 ≈ 𝑊𝐻
(Non-negative Matrix Factorization – NMF)
Topic modeling
via Latent Dirichlet Allocation (LDA)

documents
documents
→
→
→
terms
‘flavor’, ‘great’, ‘are’,
‘quality’
(common terms, likely
evenly-distributed across
documents about food)
‘saltwater’ [taffy]
(term more likely in
documents specifically
about taffy)
‘canned’ [dog
food]
about dog food)
‘salted’ [peanuts]
about peanuts)
sparse matrix is result of
vectorization of documents’
preprocessed terms
(more on this next!)
(sparse)
matrix of
counts
Topic “factor”
in document-term relationships

terms
documents
topics
documents
0 (taffy)
1 (dog food)
2 (other)
terms
‘flavor’, ‘great’, ‘are’,
‘quality’
(common terms, likely
evenly-distributed across
topics or documents about
food)
‘saltwater’ [taffy]
topics about taffy)
‘canned’ [dog food]
topics about dog food)
‘salted’ [peanuts]
topics about peanuts)
→
→
→
Topic “factor”

Agenda
• Preprocessing terms and its role in LDA topic quality

What influences LDA models’ topic “quality”?
• Number of topics selected – all high quality topics, or only some?
• Random seed – results are vulnerable to local optima
• Term preprocessing – steps might include:
• standardization
• lemmatization
• parts-of-speech (POS) filtering
• n-grams

How preprocessing
impacts topic results
(Jupyter notebook)
tokenized only
standardized
lemmatized/POS filtered
n-grams

Demo agenda
• Preprocessing and its role in LDA topic quality

Suggested LDA best-practices
• Go “Big” or Go Home!
• Number of documents (minimize pre-filtering)
• Number of topics (don’t be afraid of going big, like 50+, 100+)
• Model passes/iterations (check docs)
• Modify recipe to taste
• Change preprocessing steps or order
• Adjust parameters

Supervised vs. Unsupervised Algorithms
Image sources: https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d
Machine
Learning
Supervised
• Labeled data
• Predict outcome
Unsupervised
• No labels
• Find hidden structure
Latent Dirichlet
Allocation (LDA)

Topic modeling: An unsupervised problem
Corpus Topics
Doc Doc
Term
Term
Term
Term-Topic
Relationship
Docs
Generically: observations, records, rows dataset fitted model

terms
terms
topics
documents
documents
≈
The Topic “Factor”
0 (taffy)
1 (dog food)
2 (other)
Common terms,
likely evenly-
distributed across
topics OR documents
saltwater
taffy
(more
likely in
topic 0) canned
dog food
(more
likely in
topic 1)
salted peanuts
(more likely in
topic 2)

terms termstopics
topics
documents
documents
(sparse)
matrix of
counts
≈
𝑋 ≈ 𝑊𝐻
(Non-negative Matrix Factorization – NMF)
set of
documents
--> corpus
set of terms --> dictionary
sparse matrix is result of
“vectorization” of
documents’ term counts
Model is term-document matrix
factored into:
 Document-Topic weights, and
 Topic-Term weights
Structure of LDA
document-term-topic
relationships

documents
terms
Topic “factor”
terms
documents

Players in topic modeling
• term – a word or word token (could be n-gram, lemma, or stem)
• Examples: “plane”, “very”, “be_available”, “fault_line”
• In machine learning this would be: one of the features on which a model trains
• document – a collection of words or sentences that has a real-world purpose/context for its existence
• Examples: an email, a book or one of its chapters, a social media post, a journalistic article, etc.
• In machine learning this would be: a single piece of data, like a “row”, “record”, “observation”
• corpus – the set of documents parsed for analysis (plural: corpora)
• Examples: correspondence, book collection, articles
• In machine learning this would be: the overall dataset of rows and observations
• dictionary – the numeric id of each of the corpus’s preprocessed “words”
• Example: for “I am orange, but I am not purple” words are: 1: I, 2: am, 3: orange, 4: but, 5: not, 6: purple
• In machine learning this would be: the set of features
• vectorization – the stats of the “words” in the documents (i.e. term frequency)
• ”Document 001 has 2 instances of the word identified by the number 2”
• In machine learning this would be: how each row stacks up in terms of each feature

How preprocessing impacts topic results
all tokens
standardization
lemmas/POS
n-grams

all tokens
standardization
lemmas/POS
n-grams
{'I': 0, 'Labrador': 1, 'My': 2, 'The': 3, 'Vitality': 4, 'a': 5, 'all':
6, 'and': 7, 'appreciates': 8, 'be': 9, 'better': 10, 'better.': 11,
'bought': 12, 'canned': 13, 'dog': 14, 'finicky': 15, 'food': 16,
'found': 17, 'good': 18, 'have': 19, 'is': 20, 'it': 21, 'like': 22,
'looks': 23, 'meat': 24, 'more': 25, 'most.': 26, 'of': 27, 'processed':
28, 'product': 29, 'products': 30, 'quality.': 31, 'several': 32, 'she':
33, 'smells': 34, 'stew': 35, 'than': 36, 'the': 37, 'them': 38, 'this':
39, 'to': 40, '"Jumbo".': 41, 'Jumbo': 42, 'Not': 43, 'Peanuts...the':
44, 'Product': 45, 'Salted': 46, 'actually': 47, 'an': 48, 'arrived':
49, 'as': 50, 'error': 51, 'if': 52, 'intended': 53, 'labeled': 54,
'or': 55, 'peanuts': 56, 'represent': 57, 'sized': 58, 'small': 59,
'sure': 60, 'unsalted.': 61, 'vendor': 62, 'was': 63, 'were': 64,
'"The': 65, '-': 66, 'And': 67, 'Brother': 68, 'C.S.': 69, 'Edmund': 70,
'Filberts.': 71, 'If': 72, 'It': 73, "Lewis'": 74, 'Lion,': 75,
'Sisters': 76, 'This': 77, 'Wardrobe"': 78, 'Witch,': 79, 'Witch.': 80,
'are': 81, 'around': 82, 'been': 83, 'case': 84, 'centuries.': 85,
'chewy,': 86, 'citrus': 87, 'coated': 88, 'confection': 89, 'cut': 90,
'familiar': 91, 'few': 92, 'flavorful.': 93, 'gelatin': 94, 'has': 95,
'heaven.': 96, 'highly': 97, 'his': 98, 'in': 99, 'into': 100, . . .

all tokens
standardization
lemmas/POS
n-grams
{'I': 0, 'Labrador': 1, 'My': 2, 'The': 3, 'Vitality': 4, 'a': 5,
'all': 6, 'and': 7, 'appreciates': 8, 'be': 9, 'better': 10,
'better.': 11, 'bought': 12, 'canned': 13, 'dog': 14, 'finicky': 15,
'food': 16, 'found': 17, 'good': 18, 'have': 19, 'is': 20, 'it': 21,
'like': 22, 'looks': 23, 'meat': 24, 'more': 25, 'most.': 26, 'of':
27, 'processed': 28, 'product': 29, 'products': 30, 'quality.': 31,
'several': 32, 'she': 33, 'smells': 34, 'stew': 35, 'than': 36,
'the': 37, 'them': 38, 'this': 39, 'to': 40, '"Jumbo".': 41, 'Jumbo':
42, 'Not': 43, 'Peanuts...the': 44, 'Product': 45, 'Salted': 46,
'actually': 47, 'an': 48, 'arrived': 49, 'as': 50, 'error': 51, 'if':
52, 'intended': 53, 'labeled': 54, 'or': 55, 'peanuts': 56,
'represent': 57, 'sized': 58, 'small': 59, 'sure': 60, 'unsalted.':
61, 'vendor': 62, 'was': 63, 'were': 64, '"The': 65, '-': 66, 'And':
67, 'Brother': 68, 'C.S.': 69, 'Edmund': 70, 'Filberts.': 71, 'If':
72, 'It': 73, "Lewis'": 74, 'Lion,': 75, 'Sisters': 76, 'This': 77,
'Wardrobe"': 78, 'Witch,': 79, 'Witch.': 80, 'are': 81, 'around': 82,
'been': 83, 'case': 84, 'centuries.': 85, 'chewy,': 86, 'citrus': 87,
'coated': 88, 'confection': 89, 'cut': 90, 'familiar': 91, 'few': 92,
'flavorful.': 93, 'gelatin': 94, 'has': 95, 'heaven.': 96, 'highly':
97, 'his': 98, 'in': 99, 'into': 100, . . .

all tokens
standardization
lemmas/POS
n-grams
{'all': 0, 'and': 1, 'appreciates': 2, 'be': 3, 'better': 4,
'bought': 5, 'canned': 6, 'dog': 7, 'finicky': 8, 'food': 9,
'found': 10, 'good': 11, 'have': 12, 'is': 13, 'it': 14,
'labrador': 15, 'like': 16, 'looks': 17, 'meat': 18, 'more': 19,
'most': 20, 'my': 21, 'of': 22, 'processed': 23, 'product': 24,
'products': 25, 'quality': 26, 'several': 27, 'she': 28,
'smells': 29, 'stew': 30, 'than': 31, 'the': 32, 'them': 33,
'this': 34, 'to': 35, 'vitality': 36, 'actually': 37, 'an': 38,
'arrived': 39, 'as': 40, 'error': 41, 'if': 42, 'intended': 43,
'jumbo': 44, 'labeled': 45, 'not': 46, 'or': 47, 'peanuts': 48,
'represent': 49, 'salted': 50, 'sized': 51, 'small': 52, 'sure':
53, 'unsalted': 54, 'vendor': 55, 'was': 56, 'were': 57, 'are':
58, 'around': 59, 'been': 60, 'brother': 61, 'case': 62,
'centuries': 63, 'chewy': 64, 'citrus': 65, 'coated': 66,
'confection': 67, 'cut': 68, 'edmund': 69, 'familiar': 70,
'few': 71, 'filberts': 72, 'flavorful': 73, 'gelatin': 74,
'has': 75, 'heaven': 76, 'highly': 77, 'his': 78, 'in': 79,
'into': 80, 'lewis': 81, 'liberally': 82, 'light': 83, 'lion':
84, 'mouthful': 85, 'nuts': 86, 'out': 87, 'pillowy': 88,
'powdered': 89, 'recommend': 90, 'seduces': 91, 'selling': 92,
'sisters': 93, 'squares': 94, 'story': 95, 'sugar': 96, 'that':
97, 'then': 98, 'tiny': 99, 'too': 100

all tokens
standardization
lemmas/POS
n-grams
{'-PRON-': 0, 'appreciate': 1, 'be': 2, 'better': 3, 'buy': 4, 'can': 5,
'dog': 6, 'find': 7, 'finicky': 8, 'food': 9, 'good': 10, 'have': 11,
'labrador': 12, 'look': 13, 'meat': 14, 'more': 15, 'most': 16,
'process': 17, 'product': 18, 'quality': 19, 'several': 20, 'smell': 21,
'stew': 22, 'vitality': 23, 'actually': 24, 'arrive': 25, 'error': 26,
'intend': 27, 'jumbo': 28, 'label': 29, 'not': 30, 'peanut': 31,
'represent': 32, 'salt': 33, 'sized': 34, 'small': 35, 'sure': 36,
'unsalted': 37, 'vendor': 38, 'brother': 39, 'case': 40, 'century': 41,
'chewy': 42, 'citrus': 43, 'coat': 44, 'confection': 45, 'cut': 46,
'edmund': 47, 'familiar': 48, 'few': 49, 'filbert': 50, 'flavorful': 51,
'gelatin': 52, 'highly': 53, 'lewi': 54, 'liberally': 55, 'light': 56,
'lion': 57, 'mouthful': 58, 'nut': 59, 'pillowy': 60, 'powdered': 61,
'recommend': 62, 'seduce': 63, 'sell': 64, 'sister': 65, 'square': 66,
'story': 67, 'sugar': 68, 'that': 69, 'then': 70, 'tiny': 71, 'too': 72,
'treat': 73, 'very': 74, 'wardrobe': 75, 'witch': 76, 'yummy': 77,
'addition': 78, 'beer': 79, 'believe': 80, 'cherry': 81, 'extract': 82,
'flavor': 83, 'get': 84, 'ingredient': 85, 'make': 86, 'medicinal': 87,
'order': 88, 'robitussin': 89, 'root': 90, 'secret': 91, 'soda': 92,
'which': 93, 'assortment': 94, 'deal': 95, 'delivery': 96, 'great': 97,
'lover': 98, 'price': 99, 'quick': 100

all tokens
standardization
lemmas/POS
n-grams
{'-PRON-': 0, 'appreciate': 1, 'be': 2, 'better': 3, 'buy': 4, 'can':
5, 'dog': 6, 'find': 7, 'finicky': 8, 'food': 9, 'good': 10, 'have':
11, 'labrador': 12, 'look': 13, 'meat': 14, 'more': 15, 'most': 16,
'process': 17, 'product': 18, 'quality': 19, 'several': 20, 'smell':
21, 'stew': 22, 'vitality': 23, 'actually': 24, 'arrive': 25,
'error': 26, 'intend': 27, 'jumbo': 28, 'label': 29, 'not': 30,
'peanut': 31, 'represent': 32, 'salt': 33, 'sized': 34, 'small': 35,
'sure': 36, 'unsalted': 37, 'vendor': 38, 'brother': 39, 'case': 40,
'century': 41, 'chewy': 42, 'citrus': 43, 'coat': 44, 'confection':
45, 'cut': 46, 'edmund': 47, 'familiar': 48, 'few': 49, 'filbert':
50, 'flavorful': 51, 'gelatin': 52, 'highly': 53, 'lewi': 54,
'liberally': 55, 'light': 56, 'lion': 57, 'mouthful': 58, 'nut': 59,
'pillowy': 60, 'powdered': 61, 'recommend': 62, 'seduce': 63, 'sell':
64, 'sister': 65, 'square': 66, 'story': 67, 'sugar': 68, 'that': 69,
'then': 70, 'tiny': 71, 'too': 72, 'treat': 73, 'very': 74,
'wardrobe': 75, 'witch': 76, 'yummy': 77, 'addition': 78, 'beer': 79,
'believe': 80, 'cherry': 81, 'extract': 82, 'flavor_be': 83, 'get':
84, 'ingredient': 85, 'make': 86, 'medicinal': 87, 'order': 88,
'robitussin': 89, 'root': 90, 'secret': 91, 'soda': 92, 'which': 93,
'assortment': 94, 'be_very': 95, 'deal': 96, 'delivery': 97, 'great':
98, 'lover': 99, 'price': 100

all tokens
standardization
lemmas/POS
n-grams

Natural Language Processing - Principles and Practice - Gracie Diaz

Recommended

Recommended

More Related Content

Similar to Natural Language Processing - Principles and Practice - Gracie Diaz

Similar to Natural Language Processing - Principles and Practice - Gracie Diaz (20)

More from Catalina Arango

More from Catalina Arango (9)

Recently uploaded

Recently uploaded (20)

Natural Language Processing - Principles and Practice - Gracie Diaz