SlideShare a Scribd company logo
1 of 16
Download to read offline
Text summarization and visualization
using Watson studio
Binu Midhun
Developer Advocate
@Binu_Midhun
Agenda - Text Summarization and Visualization
• Introduction
• Data and preprocessing
• What is LDA topic modelling
• Why do you need one?
• How to build it?
• Q&A
DOC ID / September 28, 2019 / © 2019 IBM Corporation
IBM Cloud Registration
Please register to the cloud environment needed for
the workshop:
https://ibm.biz/BdzAmY
DOC ID / September 28, 2019 / © 2019 IBM Corporation
4
DOC ID / September 28, 2019 / © 2019 IBM Corporation
5
Text Summarization and Visualization – Why?
DOC ID / September 28, 2019 / © 2019 IBM Corporation
6
DOC ID / September 28, 2019 / © 2019 IBM Corporation
• Quickly summarize the text from documents & news feeds.
• Create topic modelling on the text to extract important topics.
• Create visualizations for better understanding of the data.
• Interpret the summary and visualization of the data.
• Analyze the text for further processing to generate recommendations or taking informed decisions.
What will you learn Today?
7
Acquiring
the data
Data
Preparation
Feature
Engineering
Data Pre-processing
DOC ID / September 28, 2019 / © 2019 IBM Corporation
8
Data Pre-processing
DOC ID / September 28, 2019 / © 2019 IBM Corporation
• NLP Pre-processing
• Tokenization
• Remove stop words
• Lemmatization
the
and
of
do
because
since
so
but
or
when
in an
9
DOC ID / September 28, 2019 / © 2019 IBM Corporation
Stop words are those words which are filtered out before further processing of text, since these
words contribute little to overall meaning, given that they are generally the most common words in
a language.
For instance, "the," "and," and "a," while all required words in a particular passage, don't generally
contribute greatly to one's understanding of content.
The quick brown fox jumps over the lazy dog.
Stop wordsStop Words
10
DOC ID / September 28, 2019 / © 2019 IBM Corporation
Stemming
Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word
in order to obtain a word stem.
running → run
Lemmatization
Lemmatization is related to stemming, differing in that lemmatization is able to capture canonical
forms based on a word's lemma.
better → good
Stop wordsNormalization
11
DOC ID / September 28, 2019 / © 2019 IBM Corporation
LDA Modeling
Latent Dirichlet Allocation
Unsupervised learning that views documents as bags of words (i.e. order does not matter).
It works by reverse engineering: builds on an original assumption:
– the document was generated by picking set of topics, and then picking a set of words for each topic. Then
tries to figure out which word belongs to which topic and it does it probabilistically
– it assumes for each w word in document m, that its topic is wrong but every other word is assigned the
correct topic.
– Probabilistically assign word w to a topic based on two things:
• what topics are in document m
• how many times word w has been assigned to a particular topic across all of the documents
It only looks at words. The rest are latent parameters
12
LDA is trying to find the recipe for each topic
E.g. Topic 1 = 50% + 30% + 20%
DOC ID / September 28, 2019 / © 2019 IBM Corporation
13
Document 1
Sentence 1
Sentence 2
Sentence 3
Document 2
Sentence 1
Sentence 2
Sentence 3
Document 3
Sentence 1
Sentence 2
Sentence 3
Topic
3
Topi
c 6
Topic 2
5
9
Topic 8
Topic
1
4
7Topic
10
Create a bag of words
of all sentences
What are the most dominant topics for each document?
DOC ID / September 28, 2019 / © 2019 IBM Corporation
14
Results
DOC ID / September 28, 2019 / © 2019 IBM Corporation
15
DOC ID / September 28, 2019 / © 2019 IBM Corporation
BUILD
https://developer.ibm.com/patterns/text-summarization-topic-
modelling-using-watson-studio-watson-nlu/
Text Summarization and Visualization – Code Pattern
16
DOC ID / September 28, 2019 / © 2019 IBM Corporation
https://www.youtube.com/watch?v=3mHy4OSyRf0
https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3
References

More Related Content

What's hot (6)

Sub verb agreement
Sub verb agreementSub verb agreement
Sub verb agreement
 
Verbs
VerbsVerbs
Verbs
 
Subject Verb Agreement
Subject Verb AgreementSubject Verb Agreement
Subject Verb Agreement
 
Subject verb agreement
Subject verb agreementSubject verb agreement
Subject verb agreement
 
Svagr
SvagrSvagr
Svagr
 
Editing: It's not as easy as it looks
Editing: It's not as easy as it looksEditing: It's not as easy as it looks
Editing: It's not as easy as it looks
 

Similar to Text summarization and visualizations nlp

information retrieval --> dictionary.ppt
information retrieval --> dictionary.pptinformation retrieval --> dictionary.ppt
information retrieval --> dictionary.ppt
ssusere3b1a2
 
24TECHNOLOGYREVIEW.COMMIT TECHNOLOGY REVIEWVOL . 120 .docx
24TECHNOLOGYREVIEW.COMMIT TECHNOLOGY REVIEWVOL . 120  .docx24TECHNOLOGYREVIEW.COMMIT TECHNOLOGY REVIEWVOL . 120  .docx
24TECHNOLOGYREVIEW.COMMIT TECHNOLOGY REVIEWVOL . 120 .docx
vickeryr87
 
All PDFs are Not Created Equal - Adlib White Paper - From Atidan
All PDFs are Not Created Equal - Adlib White Paper - From AtidanAll PDFs are Not Created Equal - Adlib White Paper - From Atidan
All PDFs are Not Created Equal - Adlib White Paper - From Atidan
David J Rosenthal
 
EMP_3rd_Quarter_Week_4_Module_on_Developing_ICT_using_MS_Word.pdf
EMP_3rd_Quarter_Week_4_Module_on_Developing_ICT_using_MS_Word.pdfEMP_3rd_Quarter_Week_4_Module_on_Developing_ICT_using_MS_Word.pdf
EMP_3rd_Quarter_Week_4_Module_on_Developing_ICT_using_MS_Word.pdf
OfeliaPedelino
 
Ads applications of ads
Ads  applications of adsAds  applications of ads
Ads applications of ads
Tech_MX
 
NLP and Machine Learning for non-experts
NLP and Machine Learning for non-expertsNLP and Machine Learning for non-experts
NLP and Machine Learning for non-experts
Sanghamitra Deb
 
Implementing Domain-Driven Design (Study Group) Chapter 3 - Context Maps
Implementing Domain-Driven Design (Study Group) Chapter 3 - Context Maps Implementing Domain-Driven Design (Study Group) Chapter 3 - Context Maps
Implementing Domain-Driven Design (Study Group) Chapter 3 - Context Maps
Eason Kuo
 

Similar to Text summarization and visualizations nlp (20)

DATs, LFPs and OPTs, Oh My!
DATs, LFPs and OPTs, Oh My!DATs, LFPs and OPTs, Oh My!
DATs, LFPs and OPTs, Oh My!
 
Cs207 1
Cs207 1Cs207 1
Cs207 1
 
Document coherence
Document coherenceDocument coherence
Document coherence
 
document coherence
 document coherence document coherence
document coherence
 
information retrieval --> dictionary.ppt
information retrieval --> dictionary.pptinformation retrieval --> dictionary.ppt
information retrieval --> dictionary.ppt
 
Is Your Message Lost In Your 20th-Century Digital Document Navigation Design?
Is Your Message Lost In Your 20th-Century Digital Document Navigation Design?Is Your Message Lost In Your 20th-Century Digital Document Navigation Design?
Is Your Message Lost In Your 20th-Century Digital Document Navigation Design?
 
Publishability workshop: Writing readable academic text
Publishability workshop: Writing readable academic textPublishability workshop: Writing readable academic text
Publishability workshop: Writing readable academic text
 
Inventing The Next Business Programming Language
Inventing The Next Business Programming LanguageInventing The Next Business Programming Language
Inventing The Next Business Programming Language
 
24TECHNOLOGYREVIEW.COMMIT TECHNOLOGY REVIEWVOL . 120 .docx
24TECHNOLOGYREVIEW.COMMIT TECHNOLOGY REVIEWVOL . 120  .docx24TECHNOLOGYREVIEW.COMMIT TECHNOLOGY REVIEWVOL . 120  .docx
24TECHNOLOGYREVIEW.COMMIT TECHNOLOGY REVIEWVOL . 120 .docx
 
Toc08 Goldthwaite Digitizing Your Backlist
Toc08 Goldthwaite Digitizing Your BacklistToc08 Goldthwaite Digitizing Your Backlist
Toc08 Goldthwaite Digitizing Your Backlist
 
Documenting Good Practices in School: Part 3
Documenting Good Practices in School: Part 3Documenting Good Practices in School: Part 3
Documenting Good Practices in School: Part 3
 
textTOvec: DEEP CONTEXTUALIZED NEURAL AUTOREGRESSIVE TOPIC MODELS OF LANGUAGE...
textTOvec: DEEP CONTEXTUALIZED NEURAL AUTOREGRESSIVE TOPIC MODELS OF LANGUAGE...textTOvec: DEEP CONTEXTUALIZED NEURAL AUTOREGRESSIVE TOPIC MODELS OF LANGUAGE...
textTOvec: DEEP CONTEXTUALIZED NEURAL AUTOREGRESSIVE TOPIC MODELS OF LANGUAGE...
 
All PDFs are Not Created Equal - Adlib White Paper - From Atidan
All PDFs are Not Created Equal - Adlib White Paper - From AtidanAll PDFs are Not Created Equal - Adlib White Paper - From Atidan
All PDFs are Not Created Equal - Adlib White Paper - From Atidan
 
EMP_3rd_Quarter_Week_4_Module_on_Developing_ICT_using_MS_Word.pdf
EMP_3rd_Quarter_Week_4_Module_on_Developing_ICT_using_MS_Word.pdfEMP_3rd_Quarter_Week_4_Module_on_Developing_ICT_using_MS_Word.pdf
EMP_3rd_Quarter_Week_4_Module_on_Developing_ICT_using_MS_Word.pdf
 
Ads applications of ads
Ads  applications of adsAds  applications of ads
Ads applications of ads
 
NLP and Machine Learning for non-experts
NLP and Machine Learning for non-expertsNLP and Machine Learning for non-experts
NLP and Machine Learning for non-experts
 
Ijsrp p8748
Ijsrp p8748Ijsrp p8748
Ijsrp p8748
 
Using weak supervision and transfer learning techniques to build knowledge gr...
Using weak supervision and transfer learning techniques to build knowledge gr...Using weak supervision and transfer learning techniques to build knowledge gr...
Using weak supervision and transfer learning techniques to build knowledge gr...
 
Implementing Domain-Driven Design (Study Group) Chapter 3 - Context Maps
Implementing Domain-Driven Design (Study Group) Chapter 3 - Context Maps Implementing Domain-Driven Design (Study Group) Chapter 3 - Context Maps
Implementing Domain-Driven Design (Study Group) Chapter 3 - Context Maps
 
WEBINAR PRESENTATION: PDFA - its more than you think
WEBINAR PRESENTATION: PDFA - its more than you thinkWEBINAR PRESENTATION: PDFA - its more than you think
WEBINAR PRESENTATION: PDFA - its more than you think
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Text summarization and visualizations nlp

  • 1. Text summarization and visualization using Watson studio Binu Midhun Developer Advocate @Binu_Midhun
  • 2. Agenda - Text Summarization and Visualization • Introduction • Data and preprocessing • What is LDA topic modelling • Why do you need one? • How to build it? • Q&A DOC ID / September 28, 2019 / © 2019 IBM Corporation
  • 3. IBM Cloud Registration Please register to the cloud environment needed for the workshop: https://ibm.biz/BdzAmY DOC ID / September 28, 2019 / © 2019 IBM Corporation
  • 4. 4 DOC ID / September 28, 2019 / © 2019 IBM Corporation
  • 5. 5 Text Summarization and Visualization – Why? DOC ID / September 28, 2019 / © 2019 IBM Corporation
  • 6. 6 DOC ID / September 28, 2019 / © 2019 IBM Corporation • Quickly summarize the text from documents & news feeds. • Create topic modelling on the text to extract important topics. • Create visualizations for better understanding of the data. • Interpret the summary and visualization of the data. • Analyze the text for further processing to generate recommendations or taking informed decisions. What will you learn Today?
  • 8. 8 Data Pre-processing DOC ID / September 28, 2019 / © 2019 IBM Corporation • NLP Pre-processing • Tokenization • Remove stop words • Lemmatization the and of do because since so but or when in an
  • 9. 9 DOC ID / September 28, 2019 / © 2019 IBM Corporation Stop words are those words which are filtered out before further processing of text, since these words contribute little to overall meaning, given that they are generally the most common words in a language. For instance, "the," "and," and "a," while all required words in a particular passage, don't generally contribute greatly to one's understanding of content. The quick brown fox jumps over the lazy dog. Stop wordsStop Words
  • 10. 10 DOC ID / September 28, 2019 / © 2019 IBM Corporation Stemming Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem. running → run Lemmatization Lemmatization is related to stemming, differing in that lemmatization is able to capture canonical forms based on a word's lemma. better → good Stop wordsNormalization
  • 11. 11 DOC ID / September 28, 2019 / © 2019 IBM Corporation LDA Modeling Latent Dirichlet Allocation Unsupervised learning that views documents as bags of words (i.e. order does not matter). It works by reverse engineering: builds on an original assumption: – the document was generated by picking set of topics, and then picking a set of words for each topic. Then tries to figure out which word belongs to which topic and it does it probabilistically – it assumes for each w word in document m, that its topic is wrong but every other word is assigned the correct topic. – Probabilistically assign word w to a topic based on two things: • what topics are in document m • how many times word w has been assigned to a particular topic across all of the documents It only looks at words. The rest are latent parameters
  • 12. 12 LDA is trying to find the recipe for each topic E.g. Topic 1 = 50% + 30% + 20% DOC ID / September 28, 2019 / © 2019 IBM Corporation
  • 13. 13 Document 1 Sentence 1 Sentence 2 Sentence 3 Document 2 Sentence 1 Sentence 2 Sentence 3 Document 3 Sentence 1 Sentence 2 Sentence 3 Topic 3 Topi c 6 Topic 2 5 9 Topic 8 Topic 1 4 7Topic 10 Create a bag of words of all sentences What are the most dominant topics for each document? DOC ID / September 28, 2019 / © 2019 IBM Corporation
  • 14. 14 Results DOC ID / September 28, 2019 / © 2019 IBM Corporation
  • 15. 15 DOC ID / September 28, 2019 / © 2019 IBM Corporation BUILD https://developer.ibm.com/patterns/text-summarization-topic- modelling-using-watson-studio-watson-nlu/ Text Summarization and Visualization – Code Pattern
  • 16. 16 DOC ID / September 28, 2019 / © 2019 IBM Corporation https://www.youtube.com/watch?v=3mHy4OSyRf0 https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3 References