More Related Content Similar to Text summarization and visualizations nlp (20) Text summarization and visualizations nlp2. Agenda - Text Summarization and Visualization
• Introduction
• Data and preprocessing
• What is LDA topic modelling
• Why do you need one?
• How to build it?
• Q&A
DOC ID / September 28, 2019 / © 2019 IBM Corporation
3. IBM Cloud Registration
Please register to the cloud environment needed for
the workshop:
https://ibm.biz/BdzAmY
DOC ID / September 28, 2019 / © 2019 IBM Corporation
4. 4
DOC ID / September 28, 2019 / © 2019 IBM Corporation
6. 6
DOC ID / September 28, 2019 / © 2019 IBM Corporation
• Quickly summarize the text from documents & news feeds.
• Create topic modelling on the text to extract important topics.
• Create visualizations for better understanding of the data.
• Interpret the summary and visualization of the data.
• Analyze the text for further processing to generate recommendations or taking informed decisions.
What will you learn Today?
8. 8
Data Pre-processing
DOC ID / September 28, 2019 / © 2019 IBM Corporation
• NLP Pre-processing
• Tokenization
• Remove stop words
• Lemmatization
the
and
of
do
because
since
so
but
or
when
in an
9. 9
DOC ID / September 28, 2019 / © 2019 IBM Corporation
Stop words are those words which are filtered out before further processing of text, since these
words contribute little to overall meaning, given that they are generally the most common words in
a language.
For instance, "the," "and," and "a," while all required words in a particular passage, don't generally
contribute greatly to one's understanding of content.
The quick brown fox jumps over the lazy dog.
Stop wordsStop Words
10. 10
DOC ID / September 28, 2019 / © 2019 IBM Corporation
Stemming
Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word
in order to obtain a word stem.
running → run
Lemmatization
Lemmatization is related to stemming, differing in that lemmatization is able to capture canonical
forms based on a word's lemma.
better → good
Stop wordsNormalization
11. 11
DOC ID / September 28, 2019 / © 2019 IBM Corporation
LDA Modeling
Latent Dirichlet Allocation
Unsupervised learning that views documents as bags of words (i.e. order does not matter).
It works by reverse engineering: builds on an original assumption:
– the document was generated by picking set of topics, and then picking a set of words for each topic. Then
tries to figure out which word belongs to which topic and it does it probabilistically
– it assumes for each w word in document m, that its topic is wrong but every other word is assigned the
correct topic.
– Probabilistically assign word w to a topic based on two things:
• what topics are in document m
• how many times word w has been assigned to a particular topic across all of the documents
It only looks at words. The rest are latent parameters
12. 12
LDA is trying to find the recipe for each topic
E.g. Topic 1 = 50% + 30% + 20%
DOC ID / September 28, 2019 / © 2019 IBM Corporation
13. 13
Document 1
Sentence 1
Sentence 2
Sentence 3
Document 2
Sentence 1
Sentence 2
Sentence 3
Document 3
Sentence 1
Sentence 2
Sentence 3
Topic
3
Topi
c 6
Topic 2
5
9
Topic 8
Topic
1
4
7Topic
10
Create a bag of words
of all sentences
What are the most dominant topics for each document?
DOC ID / September 28, 2019 / © 2019 IBM Corporation
15. 15
DOC ID / September 28, 2019 / © 2019 IBM Corporation
BUILD
https://developer.ibm.com/patterns/text-summarization-topic-
modelling-using-watson-studio-watson-nlu/
Text Summarization and Visualization – Code Pattern
16. 16
DOC ID / September 28, 2019 / © 2019 IBM Corporation
https://www.youtube.com/watch?v=3mHy4OSyRf0
https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3
References