SlideShare a Scribd company logo
1 of 35
Demo: Topic Modeling in
Natural Language Processing
WiDS Miami | March 4, 2019
Gracie Diaz | Royal Caribbean Cruises Ltd.
Agenda
• Data exploration: Amazon.com food product reviews
• Topic modeling via Latent Dirichlet Allocation (LDA)
• Preprocessing and its role in LDA topic quality
• Closing thoughts: Suggested LDA best-practices
Agenda
• Data exploration: Amazon.com food product reviews
• Topic modeling via Latent Dirichlet Allocation (LDA)
• Preprocessing and its role in LDA topic quality
• Closing thoughts: Suggested LDA best-practices
Source: Amazon Fine Food Reviews, “568,454 food reviews Amazon users left up to October
2012”, https://www.kaggle.com/snap/amazon-fine-food-reviews
Data: 500k+ food product reviews
from Amazon.com
Topics we can see:
“taffy”
Topics we can see:
“dog food”
Topics we can see:
…… (other)
Agenda
• Data exploration: Amazon.com food product reviews
• Topic modeling via Latent Dirichlet Allocation (LDA)
• Preprocessing and its role in LDA topic quality
• Closing thoughts: Suggested LDA best-practices
terms termstopics
topics
documents
documents
(sparse)
matrix of
counts
≈
𝑋 ≈ 𝑊𝐻
(Non-negative Matrix Factorization – NMF)
Topic modeling
via Latent Dirichlet Allocation (LDA)
documents
documents
→
→
→
terms
‘flavor’, ‘great’, ‘are’,
‘quality’
(common terms, likely
evenly-distributed across
documents about food)
‘saltwater’ [taffy]
(term more likely in
documents specifically
about taffy)
‘canned’ [dog
food]
(term more likely in
documents specifically
about dog food)
‘salted’ [peanuts]
(term more likely in
documents specifically
about peanuts)
sparse matrix is result of
vectorization of documents’
preprocessed terms
(more on this next!)
(sparse)
matrix of
counts
Topic “factor”
in document-term relationships
terms
documents
topics
documents
0 (taffy)
1 (dog food)
2 (other)
terms
‘flavor’, ‘great’, ‘are’,
‘quality’
(common terms, likely
evenly-distributed across
topics or documents about
food)
‘saltwater’ [taffy]
(term more likely in
topics about taffy)
‘canned’ [dog food]
(term more likely in
topics about dog food)
‘salted’ [peanuts]
(term more likely in
topics about peanuts)
→
→
→
Topic “factor”
in document-term relationships
Agenda
• Data exploration: Amazon.com food product reviews
• Topic modeling via Latent Dirichlet Allocation (LDA)
• Preprocessing terms and its role in LDA topic quality
• Closing thoughts: Suggested LDA best-practices
What influences LDA models’ topic “quality”?
• Number of topics selected – all high quality topics, or only some?
• Random seed – results are vulnerable to local optima
• Term preprocessing – steps might include:
• standardization
• lemmatization
• parts-of-speech (POS) filtering
• n-grams
How preprocessing
impacts topic results
(Jupyter notebook)
tokenized only
standardized
lemmatized/POS filtered
n-grams
Demo agenda
• Data exploration: Amazon.com food product reviews
• Topic modeling via Latent Dirichlet Allocation (LDA)
• Preprocessing and its role in LDA topic quality
• Closing thoughts: Suggested LDA best-practices
Suggested LDA best-practices
• Go “Big” or Go Home!
• Number of documents (minimize pre-filtering)
• Number of topics (don’t be afraid of going big, like 50+, 100+)
• Model passes/iterations (check docs)
• Modify recipe to taste
• Change preprocessing steps or order
• Adjust parameters
Thank you!
Appendices
Supervised vs. Unsupervised Algorithms
Image sources: https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d
Machine
Learning
Supervised
• Labeled data
• Predict outcome
Unsupervised
• No labels
• Find hidden structure
Latent Dirichlet
Allocation (LDA)
Topic modeling: An unsupervised problem
Corpus Topics
Doc Doc
Term
Term
Term
Term-Topic
Relationship
Docs
Generically: observations, records, rows dataset fitted model
terms
terms
topics
documents
documents
≈
The Topic “Factor”
0 (taffy)
1 (dog food)
2 (other)
Common terms,
likely evenly-
distributed across
topics OR documents
saltwater
taffy
(more
likely in
topic 0) canned
dog food
(more
likely in
topic 1)
salted peanuts
(more likely in
topic 2)
terms termstopics
topics
documents
documents
(sparse)
matrix of
counts
≈
𝑋 ≈ 𝑊𝐻
(Non-negative Matrix Factorization – NMF)
set of
documents
--> corpus
set of terms --> dictionary
sparse matrix is result of
“vectorization” of
documents’ term counts
Model is term-document matrix
factored into:
 Document-Topic weights, and
 Topic-Term weights
Structure of LDA
document-term-topic
relationships
terms termstopics
topics
documents
documents
(sparse)
matrix of
counts
≈
𝑋 ≈ 𝑊𝐻
(Non-negative Matrix Factorization – NMF)
Topic modeling
via Latent Dirichlet Allocation (LDA)
documents
terms
Topic “factor”
in document-term relationships
terms
documents
Players in topic modeling
• term – a word or word token (could be n-gram, lemma, or stem)
• Examples: “plane”, “very”, “be_available”, “fault_line”
• In machine learning this would be: one of the features on which a model trains
• document – a collection of words or sentences that has a real-world purpose/context for its existence
• Examples: an email, a book or one of its chapters, a social media post, a journalistic article, etc.
• In machine learning this would be: a single piece of data, like a “row”, “record”, “observation”
• corpus – the set of documents parsed for analysis (plural: corpora)
• Examples: correspondence, book collection, articles
• In machine learning this would be: the overall dataset of rows and observations
• dictionary – the numeric id of each of the corpus’s preprocessed “words”
• Example: for “I am orange, but I am not purple” words are: 1: I, 2: am, 3: orange, 4: but, 5: not, 6: purple
• In machine learning this would be: the set of features
• vectorization – the stats of the “words” in the documents (i.e. term frequency)
• ”Document 001 has 2 instances of the word identified by the number 2”
• In machine learning this would be: how each row stacks up in terms of each feature
How preprocessing impacts topic results
all tokens
standardization
lemmas/POS
n-grams
How preprocessing impacts topic results
all tokens
standardization
lemmas/POS
n-grams
{'I': 0, 'Labrador': 1, 'My': 2, 'The': 3, 'Vitality': 4, 'a': 5, 'all':
6, 'and': 7, 'appreciates': 8, 'be': 9, 'better': 10, 'better.': 11,
'bought': 12, 'canned': 13, 'dog': 14, 'finicky': 15, 'food': 16,
'found': 17, 'good': 18, 'have': 19, 'is': 20, 'it': 21, 'like': 22,
'looks': 23, 'meat': 24, 'more': 25, 'most.': 26, 'of': 27, 'processed':
28, 'product': 29, 'products': 30, 'quality.': 31, 'several': 32, 'she':
33, 'smells': 34, 'stew': 35, 'than': 36, 'the': 37, 'them': 38, 'this':
39, 'to': 40, '"Jumbo".': 41, 'Jumbo': 42, 'Not': 43, 'Peanuts...the':
44, 'Product': 45, 'Salted': 46, 'actually': 47, 'an': 48, 'arrived':
49, 'as': 50, 'error': 51, 'if': 52, 'intended': 53, 'labeled': 54,
'or': 55, 'peanuts': 56, 'represent': 57, 'sized': 58, 'small': 59,
'sure': 60, 'unsalted.': 61, 'vendor': 62, 'was': 63, 'were': 64,
'"The': 65, '-': 66, 'And': 67, 'Brother': 68, 'C.S.': 69, 'Edmund': 70,
'Filberts.': 71, 'If': 72, 'It': 73, "Lewis'": 74, 'Lion,': 75,
'Sisters': 76, 'This': 77, 'Wardrobe"': 78, 'Witch,': 79, 'Witch.': 80,
'are': 81, 'around': 82, 'been': 83, 'case': 84, 'centuries.': 85,
'chewy,': 86, 'citrus': 87, 'coated': 88, 'confection': 89, 'cut': 90,
'familiar': 91, 'few': 92, 'flavorful.': 93, 'gelatin': 94, 'has': 95,
'heaven.': 96, 'highly': 97, 'his': 98, 'in': 99, 'into': 100, . . .
How preprocessing impacts topic results
all tokens
standardization
lemmas/POS
n-grams
{'I': 0, 'Labrador': 1, 'My': 2, 'The': 3, 'Vitality': 4, 'a': 5,
'all': 6, 'and': 7, 'appreciates': 8, 'be': 9, 'better': 10,
'better.': 11, 'bought': 12, 'canned': 13, 'dog': 14, 'finicky': 15,
'food': 16, 'found': 17, 'good': 18, 'have': 19, 'is': 20, 'it': 21,
'like': 22, 'looks': 23, 'meat': 24, 'more': 25, 'most.': 26, 'of':
27, 'processed': 28, 'product': 29, 'products': 30, 'quality.': 31,
'several': 32, 'she': 33, 'smells': 34, 'stew': 35, 'than': 36,
'the': 37, 'them': 38, 'this': 39, 'to': 40, '"Jumbo".': 41, 'Jumbo':
42, 'Not': 43, 'Peanuts...the': 44, 'Product': 45, 'Salted': 46,
'actually': 47, 'an': 48, 'arrived': 49, 'as': 50, 'error': 51, 'if':
52, 'intended': 53, 'labeled': 54, 'or': 55, 'peanuts': 56,
'represent': 57, 'sized': 58, 'small': 59, 'sure': 60, 'unsalted.':
61, 'vendor': 62, 'was': 63, 'were': 64, '"The': 65, '-': 66, 'And':
67, 'Brother': 68, 'C.S.': 69, 'Edmund': 70, 'Filberts.': 71, 'If':
72, 'It': 73, "Lewis'": 74, 'Lion,': 75, 'Sisters': 76, 'This': 77,
'Wardrobe"': 78, 'Witch,': 79, 'Witch.': 80, 'are': 81, 'around': 82,
'been': 83, 'case': 84, 'centuries.': 85, 'chewy,': 86, 'citrus': 87,
'coated': 88, 'confection': 89, 'cut': 90, 'familiar': 91, 'few': 92,
'flavorful.': 93, 'gelatin': 94, 'has': 95, 'heaven.': 96, 'highly':
97, 'his': 98, 'in': 99, 'into': 100, . . .
all tokens
standardization
lemmas/POS
n-grams
How preprocessing impacts topic results
{'all': 0, 'and': 1, 'appreciates': 2, 'be': 3, 'better': 4,
'bought': 5, 'canned': 6, 'dog': 7, 'finicky': 8, 'food': 9,
'found': 10, 'good': 11, 'have': 12, 'is': 13, 'it': 14,
'labrador': 15, 'like': 16, 'looks': 17, 'meat': 18, 'more': 19,
'most': 20, 'my': 21, 'of': 22, 'processed': 23, 'product': 24,
'products': 25, 'quality': 26, 'several': 27, 'she': 28,
'smells': 29, 'stew': 30, 'than': 31, 'the': 32, 'them': 33,
'this': 34, 'to': 35, 'vitality': 36, 'actually': 37, 'an': 38,
'arrived': 39, 'as': 40, 'error': 41, 'if': 42, 'intended': 43,
'jumbo': 44, 'labeled': 45, 'not': 46, 'or': 47, 'peanuts': 48,
'represent': 49, 'salted': 50, 'sized': 51, 'small': 52, 'sure':
53, 'unsalted': 54, 'vendor': 55, 'was': 56, 'were': 57, 'are':
58, 'around': 59, 'been': 60, 'brother': 61, 'case': 62,
'centuries': 63, 'chewy': 64, 'citrus': 65, 'coated': 66,
'confection': 67, 'cut': 68, 'edmund': 69, 'familiar': 70,
'few': 71, 'filberts': 72, 'flavorful': 73, 'gelatin': 74,
'has': 75, 'heaven': 76, 'highly': 77, 'his': 78, 'in': 79,
'into': 80, 'lewis': 81, 'liberally': 82, 'light': 83, 'lion':
84, 'mouthful': 85, 'nuts': 86, 'out': 87, 'pillowy': 88,
'powdered': 89, 'recommend': 90, 'seduces': 91, 'selling': 92,
'sisters': 93, 'squares': 94, 'story': 95, 'sugar': 96, 'that':
97, 'then': 98, 'tiny': 99, 'too': 100
all tokens
standardization
lemmas/POS
n-grams
How preprocessing impacts topic results
{'-PRON-': 0, 'appreciate': 1, 'be': 2, 'better': 3, 'buy': 4, 'can': 5,
'dog': 6, 'find': 7, 'finicky': 8, 'food': 9, 'good': 10, 'have': 11,
'labrador': 12, 'look': 13, 'meat': 14, 'more': 15, 'most': 16,
'process': 17, 'product': 18, 'quality': 19, 'several': 20, 'smell': 21,
'stew': 22, 'vitality': 23, 'actually': 24, 'arrive': 25, 'error': 26,
'intend': 27, 'jumbo': 28, 'label': 29, 'not': 30, 'peanut': 31,
'represent': 32, 'salt': 33, 'sized': 34, 'small': 35, 'sure': 36,
'unsalted': 37, 'vendor': 38, 'brother': 39, 'case': 40, 'century': 41,
'chewy': 42, 'citrus': 43, 'coat': 44, 'confection': 45, 'cut': 46,
'edmund': 47, 'familiar': 48, 'few': 49, 'filbert': 50, 'flavorful': 51,
'gelatin': 52, 'highly': 53, 'lewi': 54, 'liberally': 55, 'light': 56,
'lion': 57, 'mouthful': 58, 'nut': 59, 'pillowy': 60, 'powdered': 61,
'recommend': 62, 'seduce': 63, 'sell': 64, 'sister': 65, 'square': 66,
'story': 67, 'sugar': 68, 'that': 69, 'then': 70, 'tiny': 71, 'too': 72,
'treat': 73, 'very': 74, 'wardrobe': 75, 'witch': 76, 'yummy': 77,
'addition': 78, 'beer': 79, 'believe': 80, 'cherry': 81, 'extract': 82,
'flavor': 83, 'get': 84, 'ingredient': 85, 'make': 86, 'medicinal': 87,
'order': 88, 'robitussin': 89, 'root': 90, 'secret': 91, 'soda': 92,
'which': 93, 'assortment': 94, 'deal': 95, 'delivery': 96, 'great': 97,
'lover': 98, 'price': 99, 'quick': 100
all tokens
standardization
lemmas/POS
n-grams
How preprocessing impacts topic results
{'-PRON-': 0, 'appreciate': 1, 'be': 2, 'better': 3, 'buy': 4, 'can':
5, 'dog': 6, 'find': 7, 'finicky': 8, 'food': 9, 'good': 10, 'have':
11, 'labrador': 12, 'look': 13, 'meat': 14, 'more': 15, 'most': 16,
'process': 17, 'product': 18, 'quality': 19, 'several': 20, 'smell':
21, 'stew': 22, 'vitality': 23, 'actually': 24, 'arrive': 25,
'error': 26, 'intend': 27, 'jumbo': 28, 'label': 29, 'not': 30,
'peanut': 31, 'represent': 32, 'salt': 33, 'sized': 34, 'small': 35,
'sure': 36, 'unsalted': 37, 'vendor': 38, 'brother': 39, 'case': 40,
'century': 41, 'chewy': 42, 'citrus': 43, 'coat': 44, 'confection':
45, 'cut': 46, 'edmund': 47, 'familiar': 48, 'few': 49, 'filbert':
50, 'flavorful': 51, 'gelatin': 52, 'highly': 53, 'lewi': 54,
'liberally': 55, 'light': 56, 'lion': 57, 'mouthful': 58, 'nut': 59,
'pillowy': 60, 'powdered': 61, 'recommend': 62, 'seduce': 63, 'sell':
64, 'sister': 65, 'square': 66, 'story': 67, 'sugar': 68, 'that': 69,
'then': 70, 'tiny': 71, 'too': 72, 'treat': 73, 'very': 74,
'wardrobe': 75, 'witch': 76, 'yummy': 77, 'addition': 78, 'beer': 79,
'believe': 80, 'cherry': 81, 'extract': 82, 'flavor_be': 83, 'get':
84, 'ingredient': 85, 'make': 86, 'medicinal': 87, 'order': 88,
'robitussin': 89, 'root': 90, 'secret': 91, 'soda': 92, 'which': 93,
'assortment': 94, 'be_very': 95, 'deal': 96, 'delivery': 97, 'great':
98, 'lover': 99, 'price': 100
How preprocessing impacts topic results
all tokens
standardization
lemmas/POS
n-grams
all tokens
standardization
lemmas/POS
n-grams
How preprocessing impacts topic results
all tokens
standardization
lemmas/POS
n-grams
How preprocessing impacts topic results
all tokens
standardization
lemmas/POS
n-grams
How preprocessing impacts topic results

More Related Content

Similar to Natural Language Processing - Principles and Practice - Gracie Diaz

07-Classification.pptx
07-Classification.pptx07-Classification.pptx
07-Classification.pptx
Shree Shree
 
NYSAFLT summer Theisen tech workshop handout 2011 21st century tools to teac...
NYSAFLT summer Theisen tech  workshop handout 2011 21st century tools to teac...NYSAFLT summer Theisen tech  workshop handout 2011 21st century tools to teac...
NYSAFLT summer Theisen tech workshop handout 2011 21st century tools to teac...
Toni Theisen
 
NYSALT Technology keynote summer 2011
NYSALT Technology keynote summer 2011 NYSALT Technology keynote summer 2011
NYSALT Technology keynote summer 2011
Toni Theisen
 
Kampmeier ecn 2012
Kampmeier ecn 2012Kampmeier ecn 2012
Kampmeier ecn 2012
ECNOfficer
 
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
katherncarlyle
 

Similar to Natural Language Processing - Principles and Practice - Gracie Diaz (20)

MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
MongoDB .local Bengaluru 2019: A Complete Methodology to Data Modeling for Mo...
 
Develop winning federal_proposals
Develop winning federal_proposalsDevelop winning federal_proposals
Develop winning federal_proposals
 
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
 
8 Information Architecture Better Practices
8 Information Architecture Better Practices8 Information Architecture Better Practices
8 Information Architecture Better Practices
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
07-Classification.pptx
07-Classification.pptx07-Classification.pptx
07-Classification.pptx
 
NYSAFLT summer Theisen tech workshop handout 2011 21st century tools to teac...
NYSAFLT summer Theisen tech  workshop handout 2011 21st century tools to teac...NYSAFLT summer Theisen tech  workshop handout 2011 21st century tools to teac...
NYSAFLT summer Theisen tech workshop handout 2011 21st century tools to teac...
 
The Magical Art of Extracting Meaning From Data
The Magical Art of Extracting Meaning From DataThe Magical Art of Extracting Meaning From Data
The Magical Art of Extracting Meaning From Data
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
 
Introduction to Recommendation System
Introduction to Recommendation SystemIntroduction to Recommendation System
Introduction to Recommendation System
 
Quality Research
Quality Research Quality Research
Quality Research
 
NYSALT Technology keynote summer 2011
NYSALT Technology keynote summer 2011 NYSALT Technology keynote summer 2011
NYSALT Technology keynote summer 2011
 
Kampmeier ecn 2012
Kampmeier ecn 2012Kampmeier ecn 2012
Kampmeier ecn 2012
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
MongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDB
MongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDBMongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDB
MongoDB World 2019 - A Complete Methodology to Data Modeling for MongoDB
 
Write a better FM
Write a better FMWrite a better FM
Write a better FM
 
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
 
Product Reviews - Topic Mining & Auto Topic Classification System
Product Reviews - Topic Mining & Auto Topic Classification SystemProduct Reviews - Topic Mining & Auto Topic Classification System
Product Reviews - Topic Mining & Auto Topic Classification System
 
VSSML17 L3. Clusters and Anomaly Detection
VSSML17 L3. Clusters and Anomaly DetectionVSSML17 L3. Clusters and Anomaly Detection
VSSML17 L3. Clusters and Anomaly Detection
 

More from Catalina Arango

More from Catalina Arango (9)

Intelligent Enterprise: How to Create the Customer Experience of the Future? ...
Intelligent Enterprise: How to Create the Customer Experience of the Future? ...Intelligent Enterprise: How to Create the Customer Experience of the Future? ...
Intelligent Enterprise: How to Create the Customer Experience of the Future? ...
 
It's All About the Data - Tia Dubuisson
It's All About the Data - Tia DubuissonIt's All About the Data - Tia Dubuisson
It's All About the Data - Tia Dubuisson
 
Transforming an Organization using Data Science - Successes & Challenges - Da...
Transforming an Organization using Data Science - Successes & Challenges - Da...Transforming an Organization using Data Science - Successes & Challenges - Da...
Transforming an Organization using Data Science - Successes & Challenges - Da...
 
Digital Data Transformation
Digital Data TransformationDigital Data Transformation
Digital Data Transformation
 
Let's paint a Picasso - A Look at Generative Adversarial Networks (GAN) and i...
Let's paint a Picasso - A Look at Generative Adversarial Networks (GAN) and i...Let's paint a Picasso - A Look at Generative Adversarial Networks (GAN) and i...
Let's paint a Picasso - A Look at Generative Adversarial Networks (GAN) and i...
 
The right JOIN: Merchants + Data
The right JOIN: Merchants + DataThe right JOIN: Merchants + Data
The right JOIN: Merchants + Data
 
Biomedical Signal Extraction for Computer-assisted Clinical Decision Making -...
Biomedical Signal Extraction for Computer-assisted Clinical Decision Making -...Biomedical Signal Extraction for Computer-assisted Clinical Decision Making -...
Biomedical Signal Extraction for Computer-assisted Clinical Decision Making -...
 
The Power of Topology - Colleen Farrelly - WiDS Miami 2018
The Power of Topology - Colleen Farrelly - WiDS Miami 2018The Power of Topology - Colleen Farrelly - WiDS Miami 2018
The Power of Topology - Colleen Farrelly - WiDS Miami 2018
 
How AI is Transforming Industry - Dr. Monica DeZulueta - WiDS Miami 2018
How AI is Transforming Industry - Dr. Monica DeZulueta - WiDS Miami 2018How AI is Transforming Industry - Dr. Monica DeZulueta - WiDS Miami 2018
How AI is Transforming Industry - Dr. Monica DeZulueta - WiDS Miami 2018
 

Recently uploaded

如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
ju0dztxtn
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
dq9vz1isj
 
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
mikehavy0
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
fztigerwe
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 

Recently uploaded (20)

如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 

Natural Language Processing - Principles and Practice - Gracie Diaz

  • 1. Demo: Topic Modeling in Natural Language Processing WiDS Miami | March 4, 2019 Gracie Diaz | Royal Caribbean Cruises Ltd.
  • 2. Agenda • Data exploration: Amazon.com food product reviews • Topic modeling via Latent Dirichlet Allocation (LDA) • Preprocessing and its role in LDA topic quality • Closing thoughts: Suggested LDA best-practices
  • 3. Agenda • Data exploration: Amazon.com food product reviews • Topic modeling via Latent Dirichlet Allocation (LDA) • Preprocessing and its role in LDA topic quality • Closing thoughts: Suggested LDA best-practices
  • 4. Source: Amazon Fine Food Reviews, “568,454 food reviews Amazon users left up to October 2012”, https://www.kaggle.com/snap/amazon-fine-food-reviews Data: 500k+ food product reviews from Amazon.com
  • 5. Topics we can see: “taffy”
  • 6. Topics we can see: “dog food”
  • 7. Topics we can see: …… (other)
  • 8. Agenda • Data exploration: Amazon.com food product reviews • Topic modeling via Latent Dirichlet Allocation (LDA) • Preprocessing and its role in LDA topic quality • Closing thoughts: Suggested LDA best-practices
  • 9. terms termstopics topics documents documents (sparse) matrix of counts ≈ 𝑋 ≈ 𝑊𝐻 (Non-negative Matrix Factorization – NMF) Topic modeling via Latent Dirichlet Allocation (LDA)
  • 10. documents documents → → → terms ‘flavor’, ‘great’, ‘are’, ‘quality’ (common terms, likely evenly-distributed across documents about food) ‘saltwater’ [taffy] (term more likely in documents specifically about taffy) ‘canned’ [dog food] (term more likely in documents specifically about dog food) ‘salted’ [peanuts] (term more likely in documents specifically about peanuts) sparse matrix is result of vectorization of documents’ preprocessed terms (more on this next!) (sparse) matrix of counts Topic “factor” in document-term relationships
  • 11. terms documents topics documents 0 (taffy) 1 (dog food) 2 (other) terms ‘flavor’, ‘great’, ‘are’, ‘quality’ (common terms, likely evenly-distributed across topics or documents about food) ‘saltwater’ [taffy] (term more likely in topics about taffy) ‘canned’ [dog food] (term more likely in topics about dog food) ‘salted’ [peanuts] (term more likely in topics about peanuts) → → → Topic “factor” in document-term relationships
  • 12. Agenda • Data exploration: Amazon.com food product reviews • Topic modeling via Latent Dirichlet Allocation (LDA) • Preprocessing terms and its role in LDA topic quality • Closing thoughts: Suggested LDA best-practices
  • 13. What influences LDA models’ topic “quality”? • Number of topics selected – all high quality topics, or only some? • Random seed – results are vulnerable to local optima • Term preprocessing – steps might include: • standardization • lemmatization • parts-of-speech (POS) filtering • n-grams
  • 14. How preprocessing impacts topic results (Jupyter notebook) tokenized only standardized lemmatized/POS filtered n-grams
  • 15. Demo agenda • Data exploration: Amazon.com food product reviews • Topic modeling via Latent Dirichlet Allocation (LDA) • Preprocessing and its role in LDA topic quality • Closing thoughts: Suggested LDA best-practices
  • 16. Suggested LDA best-practices • Go “Big” or Go Home! • Number of documents (minimize pre-filtering) • Number of topics (don’t be afraid of going big, like 50+, 100+) • Model passes/iterations (check docs) • Modify recipe to taste • Change preprocessing steps or order • Adjust parameters
  • 19. Supervised vs. Unsupervised Algorithms Image sources: https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d Machine Learning Supervised • Labeled data • Predict outcome Unsupervised • No labels • Find hidden structure Latent Dirichlet Allocation (LDA)
  • 20. Topic modeling: An unsupervised problem Corpus Topics Doc Doc Term Term Term Term-Topic Relationship Docs Generically: observations, records, rows dataset fitted model
  • 21. terms terms topics documents documents ≈ The Topic “Factor” 0 (taffy) 1 (dog food) 2 (other) Common terms, likely evenly- distributed across topics OR documents saltwater taffy (more likely in topic 0) canned dog food (more likely in topic 1) salted peanuts (more likely in topic 2)
  • 22. terms termstopics topics documents documents (sparse) matrix of counts ≈ 𝑋 ≈ 𝑊𝐻 (Non-negative Matrix Factorization – NMF) set of documents --> corpus set of terms --> dictionary sparse matrix is result of “vectorization” of documents’ term counts Model is term-document matrix factored into:  Document-Topic weights, and  Topic-Term weights Structure of LDA document-term-topic relationships
  • 23. terms termstopics topics documents documents (sparse) matrix of counts ≈ 𝑋 ≈ 𝑊𝐻 (Non-negative Matrix Factorization – NMF) Topic modeling via Latent Dirichlet Allocation (LDA)
  • 25. Players in topic modeling • term – a word or word token (could be n-gram, lemma, or stem) • Examples: “plane”, “very”, “be_available”, “fault_line” • In machine learning this would be: one of the features on which a model trains • document – a collection of words or sentences that has a real-world purpose/context for its existence • Examples: an email, a book or one of its chapters, a social media post, a journalistic article, etc. • In machine learning this would be: a single piece of data, like a “row”, “record”, “observation” • corpus – the set of documents parsed for analysis (plural: corpora) • Examples: correspondence, book collection, articles • In machine learning this would be: the overall dataset of rows and observations • dictionary – the numeric id of each of the corpus’s preprocessed “words” • Example: for “I am orange, but I am not purple” words are: 1: I, 2: am, 3: orange, 4: but, 5: not, 6: purple • In machine learning this would be: the set of features • vectorization – the stats of the “words” in the documents (i.e. term frequency) • ”Document 001 has 2 instances of the word identified by the number 2” • In machine learning this would be: how each row stacks up in terms of each feature
  • 26. How preprocessing impacts topic results all tokens standardization lemmas/POS n-grams
  • 27. How preprocessing impacts topic results all tokens standardization lemmas/POS n-grams {'I': 0, 'Labrador': 1, 'My': 2, 'The': 3, 'Vitality': 4, 'a': 5, 'all': 6, 'and': 7, 'appreciates': 8, 'be': 9, 'better': 10, 'better.': 11, 'bought': 12, 'canned': 13, 'dog': 14, 'finicky': 15, 'food': 16, 'found': 17, 'good': 18, 'have': 19, 'is': 20, 'it': 21, 'like': 22, 'looks': 23, 'meat': 24, 'more': 25, 'most.': 26, 'of': 27, 'processed': 28, 'product': 29, 'products': 30, 'quality.': 31, 'several': 32, 'she': 33, 'smells': 34, 'stew': 35, 'than': 36, 'the': 37, 'them': 38, 'this': 39, 'to': 40, '"Jumbo".': 41, 'Jumbo': 42, 'Not': 43, 'Peanuts...the': 44, 'Product': 45, 'Salted': 46, 'actually': 47, 'an': 48, 'arrived': 49, 'as': 50, 'error': 51, 'if': 52, 'intended': 53, 'labeled': 54, 'or': 55, 'peanuts': 56, 'represent': 57, 'sized': 58, 'small': 59, 'sure': 60, 'unsalted.': 61, 'vendor': 62, 'was': 63, 'were': 64, '"The': 65, '-': 66, 'And': 67, 'Brother': 68, 'C.S.': 69, 'Edmund': 70, 'Filberts.': 71, 'If': 72, 'It': 73, "Lewis'": 74, 'Lion,': 75, 'Sisters': 76, 'This': 77, 'Wardrobe"': 78, 'Witch,': 79, 'Witch.': 80, 'are': 81, 'around': 82, 'been': 83, 'case': 84, 'centuries.': 85, 'chewy,': 86, 'citrus': 87, 'coated': 88, 'confection': 89, 'cut': 90, 'familiar': 91, 'few': 92, 'flavorful.': 93, 'gelatin': 94, 'has': 95, 'heaven.': 96, 'highly': 97, 'his': 98, 'in': 99, 'into': 100, . . .
  • 28. How preprocessing impacts topic results all tokens standardization lemmas/POS n-grams {'I': 0, 'Labrador': 1, 'My': 2, 'The': 3, 'Vitality': 4, 'a': 5, 'all': 6, 'and': 7, 'appreciates': 8, 'be': 9, 'better': 10, 'better.': 11, 'bought': 12, 'canned': 13, 'dog': 14, 'finicky': 15, 'food': 16, 'found': 17, 'good': 18, 'have': 19, 'is': 20, 'it': 21, 'like': 22, 'looks': 23, 'meat': 24, 'more': 25, 'most.': 26, 'of': 27, 'processed': 28, 'product': 29, 'products': 30, 'quality.': 31, 'several': 32, 'she': 33, 'smells': 34, 'stew': 35, 'than': 36, 'the': 37, 'them': 38, 'this': 39, 'to': 40, '"Jumbo".': 41, 'Jumbo': 42, 'Not': 43, 'Peanuts...the': 44, 'Product': 45, 'Salted': 46, 'actually': 47, 'an': 48, 'arrived': 49, 'as': 50, 'error': 51, 'if': 52, 'intended': 53, 'labeled': 54, 'or': 55, 'peanuts': 56, 'represent': 57, 'sized': 58, 'small': 59, 'sure': 60, 'unsalted.': 61, 'vendor': 62, 'was': 63, 'were': 64, '"The': 65, '-': 66, 'And': 67, 'Brother': 68, 'C.S.': 69, 'Edmund': 70, 'Filberts.': 71, 'If': 72, 'It': 73, "Lewis'": 74, 'Lion,': 75, 'Sisters': 76, 'This': 77, 'Wardrobe"': 78, 'Witch,': 79, 'Witch.': 80, 'are': 81, 'around': 82, 'been': 83, 'case': 84, 'centuries.': 85, 'chewy,': 86, 'citrus': 87, 'coated': 88, 'confection': 89, 'cut': 90, 'familiar': 91, 'few': 92, 'flavorful.': 93, 'gelatin': 94, 'has': 95, 'heaven.': 96, 'highly': 97, 'his': 98, 'in': 99, 'into': 100, . . .
  • 29. all tokens standardization lemmas/POS n-grams How preprocessing impacts topic results {'all': 0, 'and': 1, 'appreciates': 2, 'be': 3, 'better': 4, 'bought': 5, 'canned': 6, 'dog': 7, 'finicky': 8, 'food': 9, 'found': 10, 'good': 11, 'have': 12, 'is': 13, 'it': 14, 'labrador': 15, 'like': 16, 'looks': 17, 'meat': 18, 'more': 19, 'most': 20, 'my': 21, 'of': 22, 'processed': 23, 'product': 24, 'products': 25, 'quality': 26, 'several': 27, 'she': 28, 'smells': 29, 'stew': 30, 'than': 31, 'the': 32, 'them': 33, 'this': 34, 'to': 35, 'vitality': 36, 'actually': 37, 'an': 38, 'arrived': 39, 'as': 40, 'error': 41, 'if': 42, 'intended': 43, 'jumbo': 44, 'labeled': 45, 'not': 46, 'or': 47, 'peanuts': 48, 'represent': 49, 'salted': 50, 'sized': 51, 'small': 52, 'sure': 53, 'unsalted': 54, 'vendor': 55, 'was': 56, 'were': 57, 'are': 58, 'around': 59, 'been': 60, 'brother': 61, 'case': 62, 'centuries': 63, 'chewy': 64, 'citrus': 65, 'coated': 66, 'confection': 67, 'cut': 68, 'edmund': 69, 'familiar': 70, 'few': 71, 'filberts': 72, 'flavorful': 73, 'gelatin': 74, 'has': 75, 'heaven': 76, 'highly': 77, 'his': 78, 'in': 79, 'into': 80, 'lewis': 81, 'liberally': 82, 'light': 83, 'lion': 84, 'mouthful': 85, 'nuts': 86, 'out': 87, 'pillowy': 88, 'powdered': 89, 'recommend': 90, 'seduces': 91, 'selling': 92, 'sisters': 93, 'squares': 94, 'story': 95, 'sugar': 96, 'that': 97, 'then': 98, 'tiny': 99, 'too': 100
  • 30. all tokens standardization lemmas/POS n-grams How preprocessing impacts topic results {'-PRON-': 0, 'appreciate': 1, 'be': 2, 'better': 3, 'buy': 4, 'can': 5, 'dog': 6, 'find': 7, 'finicky': 8, 'food': 9, 'good': 10, 'have': 11, 'labrador': 12, 'look': 13, 'meat': 14, 'more': 15, 'most': 16, 'process': 17, 'product': 18, 'quality': 19, 'several': 20, 'smell': 21, 'stew': 22, 'vitality': 23, 'actually': 24, 'arrive': 25, 'error': 26, 'intend': 27, 'jumbo': 28, 'label': 29, 'not': 30, 'peanut': 31, 'represent': 32, 'salt': 33, 'sized': 34, 'small': 35, 'sure': 36, 'unsalted': 37, 'vendor': 38, 'brother': 39, 'case': 40, 'century': 41, 'chewy': 42, 'citrus': 43, 'coat': 44, 'confection': 45, 'cut': 46, 'edmund': 47, 'familiar': 48, 'few': 49, 'filbert': 50, 'flavorful': 51, 'gelatin': 52, 'highly': 53, 'lewi': 54, 'liberally': 55, 'light': 56, 'lion': 57, 'mouthful': 58, 'nut': 59, 'pillowy': 60, 'powdered': 61, 'recommend': 62, 'seduce': 63, 'sell': 64, 'sister': 65, 'square': 66, 'story': 67, 'sugar': 68, 'that': 69, 'then': 70, 'tiny': 71, 'too': 72, 'treat': 73, 'very': 74, 'wardrobe': 75, 'witch': 76, 'yummy': 77, 'addition': 78, 'beer': 79, 'believe': 80, 'cherry': 81, 'extract': 82, 'flavor': 83, 'get': 84, 'ingredient': 85, 'make': 86, 'medicinal': 87, 'order': 88, 'robitussin': 89, 'root': 90, 'secret': 91, 'soda': 92, 'which': 93, 'assortment': 94, 'deal': 95, 'delivery': 96, 'great': 97, 'lover': 98, 'price': 99, 'quick': 100
  • 31. all tokens standardization lemmas/POS n-grams How preprocessing impacts topic results {'-PRON-': 0, 'appreciate': 1, 'be': 2, 'better': 3, 'buy': 4, 'can': 5, 'dog': 6, 'find': 7, 'finicky': 8, 'food': 9, 'good': 10, 'have': 11, 'labrador': 12, 'look': 13, 'meat': 14, 'more': 15, 'most': 16, 'process': 17, 'product': 18, 'quality': 19, 'several': 20, 'smell': 21, 'stew': 22, 'vitality': 23, 'actually': 24, 'arrive': 25, 'error': 26, 'intend': 27, 'jumbo': 28, 'label': 29, 'not': 30, 'peanut': 31, 'represent': 32, 'salt': 33, 'sized': 34, 'small': 35, 'sure': 36, 'unsalted': 37, 'vendor': 38, 'brother': 39, 'case': 40, 'century': 41, 'chewy': 42, 'citrus': 43, 'coat': 44, 'confection': 45, 'cut': 46, 'edmund': 47, 'familiar': 48, 'few': 49, 'filbert': 50, 'flavorful': 51, 'gelatin': 52, 'highly': 53, 'lewi': 54, 'liberally': 55, 'light': 56, 'lion': 57, 'mouthful': 58, 'nut': 59, 'pillowy': 60, 'powdered': 61, 'recommend': 62, 'seduce': 63, 'sell': 64, 'sister': 65, 'square': 66, 'story': 67, 'sugar': 68, 'that': 69, 'then': 70, 'tiny': 71, 'too': 72, 'treat': 73, 'very': 74, 'wardrobe': 75, 'witch': 76, 'yummy': 77, 'addition': 78, 'beer': 79, 'believe': 80, 'cherry': 81, 'extract': 82, 'flavor_be': 83, 'get': 84, 'ingredient': 85, 'make': 86, 'medicinal': 87, 'order': 88, 'robitussin': 89, 'root': 90, 'secret': 91, 'soda': 92, 'which': 93, 'assortment': 94, 'be_very': 95, 'deal': 96, 'delivery': 97, 'great': 98, 'lover': 99, 'price': 100
  • 32. How preprocessing impacts topic results all tokens standardization lemmas/POS n-grams