Text summarization and visualization
using Watson studio
Binu Midhun
Developer Advocate
@Binu_Midhun
Agenda - Text Summarization and Visualization
• Introduction
• Data and preprocessing
• What is LDA topic modelling
• Why do you need one?
• How to build it?
• Q&A
DOC ID / September 28, 2019 / © 2019 IBM Corporation
IBM Cloud Registration
Please register to the cloud environment needed for
the workshop:
https://ibm.biz/BdzAmY
DOC ID / September 28, 2019 / © 2019 IBM Corporation
4
DOC ID / September 28, 2019 / © 2019 IBM Corporation
5
Text Summarization and Visualization – Why?
DOC ID / September 28, 2019 / © 2019 IBM Corporation
6
DOC ID / September 28, 2019 / © 2019 IBM Corporation
• Quickly summarize the text from documents & news feeds.
• Create topic modelling on the text to extract important topics.
• Create visualizations for better understanding of the data.
• Interpret the summary and visualization of the data.
• Analyze the text for further processing to generate recommendations or taking informed decisions.
What will you learn Today?
7
Acquiring
the data
Data
Preparation
Feature
Engineering
Data Pre-processing
DOC ID / September 28, 2019 / © 2019 IBM Corporation
8
Data Pre-processing
DOC ID / September 28, 2019 / © 2019 IBM Corporation
• NLP Pre-processing
• Tokenization
• Remove stop words
• Lemmatization
the
and
of
do
because
since
so
but
or
when
in an
9
DOC ID / September 28, 2019 / © 2019 IBM Corporation
Stop words are those words which are filtered out before further processing of text, since these
words contribute little to overall meaning, given that they are generally the most common words in
a language.
For instance, "the," "and," and "a," while all required words in a particular passage, don't generally
contribute greatly to one's understanding of content.
The quick brown fox jumps over the lazy dog.
Stop wordsStop Words
10
DOC ID / September 28, 2019 / © 2019 IBM Corporation
Stemming
Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word
in order to obtain a word stem.
running → run
Lemmatization
Lemmatization is related to stemming, differing in that lemmatization is able to capture canonical
forms based on a word's lemma.
better → good
Stop wordsNormalization
11
DOC ID / September 28, 2019 / © 2019 IBM Corporation
LDA Modeling
Latent Dirichlet Allocation
Unsupervised learning that views documents as bags of words (i.e. order does not matter).
It works by reverse engineering: builds on an original assumption:
– the document was generated by picking set of topics, and then picking a set of words for each topic. Then
tries to figure out which word belongs to which topic and it does it probabilistically
– it assumes for each w word in document m, that its topic is wrong but every other word is assigned the
correct topic.
– Probabilistically assign word w to a topic based on two things:
• what topics are in document m
• how many times word w has been assigned to a particular topic across all of the documents
It only looks at words. The rest are latent parameters
12
LDA is trying to find the recipe for each topic
E.g. Topic 1 = 50% + 30% + 20%
DOC ID / September 28, 2019 / © 2019 IBM Corporation
13
Document 1
Sentence 1
Sentence 2
Sentence 3
Document 2
Sentence 1
Sentence 2
Sentence 3
Document 3
Sentence 1
Sentence 2
Sentence 3
Topic
3
Topi
c 6
Topic 2
5
9
Topic 8
Topic
1
4
7Topic
10
Create a bag of words
of all sentences
What are the most dominant topics for each document?
DOC ID / September 28, 2019 / © 2019 IBM Corporation
14
Results
DOC ID / September 28, 2019 / © 2019 IBM Corporation
15
DOC ID / September 28, 2019 / © 2019 IBM Corporation
BUILD
https://developer.ibm.com/patterns/text-summarization-topic-
modelling-using-watson-studio-watson-nlu/
Text Summarization and Visualization – Code Pattern
16
DOC ID / September 28, 2019 / © 2019 IBM Corporation
https://www.youtube.com/watch?v=3mHy4OSyRf0
https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3
References

Text summarization and visualizations nlp

  • 1.
    Text summarization andvisualization using Watson studio Binu Midhun Developer Advocate @Binu_Midhun
  • 2.
    Agenda - TextSummarization and Visualization • Introduction • Data and preprocessing • What is LDA topic modelling • Why do you need one? • How to build it? • Q&A DOC ID / September 28, 2019 / © 2019 IBM Corporation
  • 3.
    IBM Cloud Registration Pleaseregister to the cloud environment needed for the workshop: https://ibm.biz/BdzAmY DOC ID / September 28, 2019 / © 2019 IBM Corporation
  • 4.
    4 DOC ID /September 28, 2019 / © 2019 IBM Corporation
  • 5.
    5 Text Summarization andVisualization – Why? DOC ID / September 28, 2019 / © 2019 IBM Corporation
  • 6.
    6 DOC ID /September 28, 2019 / © 2019 IBM Corporation • Quickly summarize the text from documents & news feeds. • Create topic modelling on the text to extract important topics. • Create visualizations for better understanding of the data. • Interpret the summary and visualization of the data. • Analyze the text for further processing to generate recommendations or taking informed decisions. What will you learn Today?
  • 7.
  • 8.
    8 Data Pre-processing DOC ID/ September 28, 2019 / © 2019 IBM Corporation • NLP Pre-processing • Tokenization • Remove stop words • Lemmatization the and of do because since so but or when in an
  • 9.
    9 DOC ID /September 28, 2019 / © 2019 IBM Corporation Stop words are those words which are filtered out before further processing of text, since these words contribute little to overall meaning, given that they are generally the most common words in a language. For instance, "the," "and," and "a," while all required words in a particular passage, don't generally contribute greatly to one's understanding of content. The quick brown fox jumps over the lazy dog. Stop wordsStop Words
  • 10.
    10 DOC ID /September 28, 2019 / © 2019 IBM Corporation Stemming Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem. running → run Lemmatization Lemmatization is related to stemming, differing in that lemmatization is able to capture canonical forms based on a word's lemma. better → good Stop wordsNormalization
  • 11.
    11 DOC ID /September 28, 2019 / © 2019 IBM Corporation LDA Modeling Latent Dirichlet Allocation Unsupervised learning that views documents as bags of words (i.e. order does not matter). It works by reverse engineering: builds on an original assumption: – the document was generated by picking set of topics, and then picking a set of words for each topic. Then tries to figure out which word belongs to which topic and it does it probabilistically – it assumes for each w word in document m, that its topic is wrong but every other word is assigned the correct topic. – Probabilistically assign word w to a topic based on two things: • what topics are in document m • how many times word w has been assigned to a particular topic across all of the documents It only looks at words. The rest are latent parameters
  • 12.
    12 LDA is tryingto find the recipe for each topic E.g. Topic 1 = 50% + 30% + 20% DOC ID / September 28, 2019 / © 2019 IBM Corporation
  • 13.
    13 Document 1 Sentence 1 Sentence2 Sentence 3 Document 2 Sentence 1 Sentence 2 Sentence 3 Document 3 Sentence 1 Sentence 2 Sentence 3 Topic 3 Topi c 6 Topic 2 5 9 Topic 8 Topic 1 4 7Topic 10 Create a bag of words of all sentences What are the most dominant topics for each document? DOC ID / September 28, 2019 / © 2019 IBM Corporation
  • 14.
    14 Results DOC ID /September 28, 2019 / © 2019 IBM Corporation
  • 15.
    15 DOC ID /September 28, 2019 / © 2019 IBM Corporation BUILD https://developer.ibm.com/patterns/text-summarization-topic- modelling-using-watson-studio-watson-nlu/ Text Summarization and Visualization – Code Pattern
  • 16.
    16 DOC ID /September 28, 2019 / © 2019 IBM Corporation https://www.youtube.com/watch?v=3mHy4OSyRf0 https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3 References