SlideShare a Scribd company logo
1 of 19
Chapter 10
Text Mining
Dr. Steffen Herbold
herbold@cs.uni-goettingen.de
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• Challenges for Text Mining
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Example for Textual Data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for
iPhone]
Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you! https://t.co/eQC2NqdIil [Twitter for iPhone]
Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN rally. We need to elect
@KarinHousley to the U.S. Senate, and we need the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and
@PeteStauber in the U.S. House! [Twitter for iPhone]
Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He helped pass tax reform which lowered taxes for
EVERYONE! Nancy Pelosi is spending hundreds of thousands of dollars on his opponent because they both support a liberal
agenda of higher taxes and wasteful spending! [Twitter for iPhone]
Oct 4, 2018 02:29:27 PM “U.S. Stocks Widen Global Lead” https://t.co/Snhv08ulcO [Twitter for iPhone]
Oct 4, 2018 02:17:28 PM Statement on National Strategy for Counterterrorism: https://t.co/ajFBg9Elsj https://t.co/Qr56ycjMAV
[Twitter for iPhone]
Oct 4, 2018 12:38:08 PM Working hard, thank you! https://t.co/6HQVaEXH0I [Twitter for iPhone]
Oct 4, 2018 09:17:01 AM This is now the 7th. time the FBI has investigated Judge Kavanaugh. If we made it 100, it would still not
be good enough for the Obstructionist Democrats. [Twitter for iPhone]
Oct 4, 2018 09:01:13 AM RT @ChatByCC: While armed with the power of our Vote, we proudly & peacefully revolted Never
doubt this was an American revolution #MAGA… [Twitter for iPhone]
Oct 4, 2018 08:54:58 AM This is a very important time in our country. Due Process, Fairness and Common Sense are now on
trial! [Twitter for iPhone]
Oct 4, 2018 08:34:16 AM Our country’s great First Lady, Melania, is doing really well in Africa. The people love her, and she
loves them! It is a beautiful thing to see. [Twitter for iPhone]
Oct 4, 2018 07:16:41 AM The harsh and unfair treatment of Judge Brett Kavanaugh is having an incredible upward impact on
voters. The PEOPLE get it far better than the politicians. Most importantly, this great life cannot be ruined by mean & despicable
Democrats and totally uncorroborated allegations! [Twitter for iPhone]
How do you
analyze this?
Text Mining in General
• Inferring information from textual data
• Techniques for text mining as diverse as the texts themselves!
• The following demonstrates a relatively standard workflow for
processing text data
• Create corpus
• Pre-process contents
• Define bag-of-words
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Documents and Corpus
• Create documents to create a corpus
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for iPhone]
Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you! https://t.co/eQC2NqdIil [Twitter for iPhone]
Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN rally. We need to elect @KarinHousley to the U.S. Senate, and
we need the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and @PeteStauber in the U.S. House! [Twitter for iPhone]
Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He helped pass tax reform which lowered taxes for EVERYONE! Nancy Pelosi is spending hundreds
of thousands of dollars on his opponent because they both support a liberal agenda of higher taxes and wasteful spending! [Twitter for iPhone]
Oct 4, 2018 02:29:27 PM “U.S. Stocks Widen Global Lead” https://t.co/Snhv08ulcO [Twitter for iPhone]
Oct 4, 2018 02:17:28 PM Statement on National Strategy for Counterterrorism: https://t.co/ajFBg9Elsj https://t.co/Qr56ycjMAV [Twitter for iPhone]
Oct 4, 2018 12:38:08 PM Working hard, thank you! https://t.co/6HQVaEXH0I [Twitter for iPhone]
Oct 4, 2018 09:17:01 AM This is now the 7th. time the FBI has investigated Judge Kavanaugh. If we made it 100, it would still not be good enough for the Obstructionist
Democrats. [Twitter for iPhone]
Each tweet a document
All tweets together the
corpus
Unstructured Data  Structured Data
• Identify relevant content
 Remove irrelevant content
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for iPhone]
Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you! https://t.co/eQC2NqdIil [Twitter for iPhone]
Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN rally. We need to elect @KarinHousley to the U.S. Senate, and
we need the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and @PeteStauber in the U.S. House! [Twitter for iPhone]
Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He helped pass tax reform which lowered taxes for EVERYONE! Nancy Pelosi is spending hundreds
of thousands of dollars on his opponent because they both support a liberal agenda of higher taxes and wasteful spending! [Twitter for iPhone]
Oct 4, 2018 02:29:27 PM “U.S. Stocks Widen Global Lead” https://t.co/Snhv08ulcO [Twitter for iPhone]
Oct 4, 2018 02:17:28 PM Statement on National Strategy for Counterterrorism: https://t.co/ajFBg9Elsj https://t.co/Qr56ycjMAV [Twitter for iPhone]
Oct 4, 2018 12:38:08 PM Working hard, thank you! https://t.co/6HQVaEXH0I [Twitter for iPhone]
Oct 4, 2018 09:17:01 AM This is now the 7th. time the FBI has investigated Judge Kavanaugh. If we made it 100, it would still not be good enough for the Obstructionist
Democrats. [Twitter for iPhone]
Drop shortened links
Drop device
Drop timestamp
Punctuation and cases
• Usually do not carry information
 Can be removed
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
beautiful evening in rochester minnesota vote vote vote
thank you minnesota i love you
just made my second stop in minnesota for a make america great again rally we need to elect karinhousley to the us senate and we need the strong leadership of
tomemmer jason2cd jimhagedornmn and petestauber in the us house
congressman bishop is doing a great job he helped pass tax reform which lowered taxes for everyone nancy pelosi is spending hundreds of thousands of dollars on his
opponent because they both support a liberal agenda of higher taxes and wasteful spending
us stocks widen global lead
statement on national strategy for counterterrorism
working hard thank you
this is now the 7th time the fbi has investigated judge kavanaugh if we made it 100 it would still not be good enough for the obstructionist democrats
Stop words
• Most common words in a language (a, the, I, we, to, too, …)
 Can be removed
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
beautiful evening rochester minnesota vote vote vote
thank minnesota love
just made second stop minnesota make america great again rally need elect karinhousley senate need strong leadership tomemmer jason2cd jimhagedornmn petestauber
house
congressman bishop doing great job helped pass tax reform lowered taxes everyone nancy pelosi spending hundreds thousands dollars opponent both support liberal
agenda higher taxes wasteful spending
stocks widen global lead
statement national strategy counterterrorism
working hard thank
now 7th time fbi investigated judge kavanaugh made 100 would still good enough obstructionist democrats
Stemming and Lemmatization
• Stemming: reduce terms to common stem (spending  spend)
• Lemmatization: use common term for synonyms (better  good)
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
beautiful evening rochester minnesota vote vote vote
thank minnesota love
just make second stop minnesota make america great again rally need elect karinhousley senate need strong leadership tomemmer jason2cd jimhagedornmn petestauber
house
congressman bishop do good job help pass tax reform lower tax everyone nancy pelosi spend hundred thousand dollar opponent both support liberal agenda high tax
waste spend
stock wide global lead
statement national strategy counterterrorism
work hard thank
now 7th time fbi investigate judge kavanaugh make 100 would still good enough obstruct democrat
Bag-of-Words
beauti-
ful
evenin
g
roches
-ter
minne-
sota
vote thank love just make second stop amer-
ica
great …
1 1 1 1 3 0 0 0 0 0 0 0 0 …
0 0 0 1 0 1 1 0 0 0 0 0 0 …
0 0 0 1 0 0 0 0 2 1 1 1 1 …
0 0 0 0 0 0 0 0 0 0 0 0 0 …
0 0 0 0 0 0 0 0 0 0 0 0 0 …
0 0 0 0 0 1 0 0 0 0 0 0 0 …
0 0 0 0 0 0 0 0 1 0 0 0 0 …
… … … … … … … … … … … … … …
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Every term one
dimension
Number of occurences
in document
Inverse Document Frequency
• Absolute frequency of words problematic for discrimination
• Favors words that occur often and in many documents
• Similar to stop words
• Inverse document frequency for the uniqueness of terms
• 𝑖𝑑𝑓𝑖 = log
𝑁
𝐷𝑖
with 𝐷𝑖 the number of documents in which 𝑖 appears
• Inverse document frequency to weight terms by uniqueness
• 𝑡𝑓𝑖𝑑𝑓𝑖 = 𝑡𝑓𝑖 ⋅ 𝑖𝑑𝑓𝑖
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Downstream Analysis
• Bag of words base structure
• Allows many approaches for analysis
• Classification
• Usually requires manual labeling of data
• Clustering
• Sentiment analysis
• Based on word counts
• Information retrieval
• Identification of related documents
• Visualizations
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• Challenges for Text Mining
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
High Dimensional Data
• Bag of words gets high dimensional extremely fast
• Still over sixty different words after stemming/lemmatization in the
tweet example
• Can also have a huge amount of documents
• >39000 tweets by donald trump in total
• High dimension+many documents  very high runtime
• Requires
• Very efficient algorithms
• Often only possible through massive parallelization
• Naive Bayes is a popular choice
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Ambiguities
• Homonyms
• break (take a break, break something)
• Will all be grouped together by a bag of words
• Semantic changes in interpretation of the same sentences
• I hit the man with a stick. (Used a stick to hit the man)
• I hit the man with a stick (I hit the man who was holding a stick)
• Can often only be inferred from a greater context
 Often impossible to resolve and leads to noise in the analysis
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Capturing Syntax
• Bag of words ignores syntax
• Nine sentences with different meanings and the same words
1. Only he told his mistress that he loved her. (Nobody else did)
2. He only told his mistress that he loved her. (He didn't show her)
3. He told only his mistress that he loved her. (Kept it a secret from everyone else)
4. He told his only mistress that he loved her. (Stresses that he had only ONE!)
5. He told his mistress only that he loved her. (Didn't tell her anything else)
6. He told his mistress that only he loved her. ("I'm all you got, nobody else wants you.")
7. He told his mistress that he only loved her. (Not that he wanted to marry her.)
8. He told his mistress that he loved only her. (Yeah, don't they all...).
9. He told his mistress that he loved her only. (Similar to above one).
• Capturing word orders and other relationships greatly increases
the dimension
• E.g., n-grams
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
(bag of words over n-tuples)
And the list goes on
• Bad spelling
• Slang
• Synonyms not captured by lemmatization
• Encodings and special characters
• …
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• Challenges for Text Mining
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Summary
• Text mining is the analysis of textual data
• Requires imposing a structure upon the unstructured texts
• Bag of words
• There are some standard techniques
• Removing punctuation, cases, stopwords
• Stemming/Lemmatization
• Often also removing numbers and special characters
 Depends on context!
• Many problems due to noise in the textual data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science

More Related Content

Similar to 10-Text-Mining.pptx

Similar to 10-Text-Mining.pptx (20)

What do Twitter conversations tell us about petitioning?
What do Twitter conversations tell us about petitioning?What do Twitter conversations tell us about petitioning?
What do Twitter conversations tell us about petitioning?
 
The language of social media
The language of social mediaThe language of social media
The language of social media
 
2007 open everything at gnomedex 4.4
2007 open everything at gnomedex 4.42007 open everything at gnomedex 4.4
2007 open everything at gnomedex 4.4
 
Free Essay On Swot Analysis
Free Essay On Swot AnalysisFree Essay On Swot Analysis
Free Essay On Swot Analysis
 
Steps For Writing A Persuasive Essay. Online assignment writing service.
Steps For Writing A Persuasive Essay. Online assignment writing service.Steps For Writing A Persuasive Essay. Online assignment writing service.
Steps For Writing A Persuasive Essay. Online assignment writing service.
 
Social Media: New Tool for Government Affairs
Social Media:  New Tool for Government AffairsSocial Media:  New Tool for Government Affairs
Social Media: New Tool for Government Affairs
 
Good Quality Writing Paper
Good Quality Writing PaperGood Quality Writing Paper
Good Quality Writing Paper
 
002 Essay Example Purpose Of An Thatsnotus
002 Essay Example Purpose Of An Thatsnotus002 Essay Example Purpose Of An Thatsnotus
002 Essay Example Purpose Of An Thatsnotus
 
Essay Kazakhstan 2050. PDF Strategy quot;Kazakhstan 2050quot;: Basis of the ...
Essay Kazakhstan 2050. PDF Strategy quot;Kazakhstan  2050quot;: Basis of the ...Essay Kazakhstan 2050. PDF Strategy quot;Kazakhstan  2050quot;: Basis of the ...
Essay Kazakhstan 2050. PDF Strategy quot;Kazakhstan 2050quot;: Basis of the ...
 
Narrative Essay Narrative Writing
Narrative Essay Narrative WritingNarrative Essay Narrative Writing
Narrative Essay Narrative Writing
 
Postgraduate Essay Sample. How To Write A Masters Ess
Postgraduate Essay Sample. How To Write A Masters EssPostgraduate Essay Sample. How To Write A Masters Ess
Postgraduate Essay Sample. How To Write A Masters Ess
 
Sample Of Creative Self Introduction The Docume
Sample Of Creative Self Introduction The DocumeSample Of Creative Self Introduction The Docume
Sample Of Creative Self Introduction The Docume
 
Example Of Rogerian Argument Essay
Example Of Rogerian Argument EssayExample Of Rogerian Argument Essay
Example Of Rogerian Argument Essay
 
Who Can Help Me Write An Essay - HelpcoachS Diary
Who Can Help Me Write An Essay - HelpcoachS DiaryWho Can Help Me Write An Essay - HelpcoachS Diary
Who Can Help Me Write An Essay - HelpcoachS Diary
 
EOCD Big Data Flows vs. Wicked Leaks
EOCD Big Data Flows vs. Wicked LeaksEOCD Big Data Flows vs. Wicked Leaks
EOCD Big Data Flows vs. Wicked Leaks
 
Common Application Essay Examples
Common Application Essay ExamplesCommon Application Essay Examples
Common Application Essay Examples
 
Master Degree Sample Personal Statement Sampl
Master Degree Sample Personal Statement SamplMaster Degree Sample Personal Statement Sampl
Master Degree Sample Personal Statement Sampl
 
Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...
 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with Data
 
Write An Expository Essay On How To M
Write An Expository Essay On How To MWrite An Expository Essay On How To M
Write An Expository Essay On How To M
 

More from Shree Shree

10 rectangular options.pptx
10 rectangular options.pptx10 rectangular options.pptx
10 rectangular options.pptx
Shree Shree
 
1732002B.pptx
1732002B.pptx1732002B.pptx
1732002B.pptx
Shree Shree
 
9D52BDFC.pptx
9D52BDFC.pptx9D52BDFC.pptx
9D52BDFC.pptx
Shree Shree
 
865FC50.pptx
865FC50.pptx865FC50.pptx
865FC50.pptx
Shree Shree
 
9DE46AD6.pptx
9DE46AD6.pptx9DE46AD6.pptx
9DE46AD6.pptx
Shree Shree
 
20DE6DE0.pptx
20DE6DE0.pptx20DE6DE0.pptx
20DE6DE0.pptx
Shree Shree
 
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
Shree Shree
 
2C0A56C5.pptx
2C0A56C5.pptx2C0A56C5.pptx
2C0A56C5.pptx
Shree Shree
 
1EF5E174.pptx
1EF5E174.pptx1EF5E174.pptx
1EF5E174.pptx
Shree Shree
 
1BB40E18.pptx
1BB40E18.pptx1BB40E18.pptx
1BB40E18.pptx
Shree Shree
 
8D0ECBB0.pptx
8D0ECBB0.pptx8D0ECBB0.pptx
8D0ECBB0.pptx
Shree Shree
 
D1586BA6.pptx
D1586BA6.pptxD1586BA6.pptx
D1586BA6.pptx
Shree Shree
 
FB09958.pptx
FB09958.pptxFB09958.pptx
FB09958.pptx
Shree Shree
 
13072EF.pptx
13072EF.pptx13072EF.pptx
13072EF.pptx
Shree Shree
 
E65B78F3.pptx
E65B78F3.pptxE65B78F3.pptx
E65B78F3.pptx
Shree Shree
 
355C162.pptx
355C162.pptx355C162.pptx
355C162.pptx
Shree Shree
 
795589F6.pptx
795589F6.pptx795589F6.pptx
795589F6.pptx
Shree Shree
 
AD177DF3.pptx
AD177DF3.pptxAD177DF3.pptx
AD177DF3.pptx
Shree Shree
 
961F1B3A.pptx
961F1B3A.pptx961F1B3A.pptx
961F1B3A.pptx
Shree Shree
 
5F2D461.pptx
5F2D461.pptx5F2D461.pptx
5F2D461.pptx
Shree Shree
 

More from Shree Shree (20)

10 rectangular options.pptx
10 rectangular options.pptx10 rectangular options.pptx
10 rectangular options.pptx
 
1732002B.pptx
1732002B.pptx1732002B.pptx
1732002B.pptx
 
9D52BDFC.pptx
9D52BDFC.pptx9D52BDFC.pptx
9D52BDFC.pptx
 
865FC50.pptx
865FC50.pptx865FC50.pptx
865FC50.pptx
 
9DE46AD6.pptx
9DE46AD6.pptx9DE46AD6.pptx
9DE46AD6.pptx
 
20DE6DE0.pptx
20DE6DE0.pptx20DE6DE0.pptx
20DE6DE0.pptx
 
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
 
2C0A56C5.pptx
2C0A56C5.pptx2C0A56C5.pptx
2C0A56C5.pptx
 
1EF5E174.pptx
1EF5E174.pptx1EF5E174.pptx
1EF5E174.pptx
 
1BB40E18.pptx
1BB40E18.pptx1BB40E18.pptx
1BB40E18.pptx
 
8D0ECBB0.pptx
8D0ECBB0.pptx8D0ECBB0.pptx
8D0ECBB0.pptx
 
D1586BA6.pptx
D1586BA6.pptxD1586BA6.pptx
D1586BA6.pptx
 
FB09958.pptx
FB09958.pptxFB09958.pptx
FB09958.pptx
 
13072EF.pptx
13072EF.pptx13072EF.pptx
13072EF.pptx
 
E65B78F3.pptx
E65B78F3.pptxE65B78F3.pptx
E65B78F3.pptx
 
355C162.pptx
355C162.pptx355C162.pptx
355C162.pptx
 
795589F6.pptx
795589F6.pptx795589F6.pptx
795589F6.pptx
 
AD177DF3.pptx
AD177DF3.pptxAD177DF3.pptx
AD177DF3.pptx
 
961F1B3A.pptx
961F1B3A.pptx961F1B3A.pptx
961F1B3A.pptx
 
5F2D461.pptx
5F2D461.pptx5F2D461.pptx
5F2D461.pptx
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Recently uploaded (20)

Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 

10-Text-Mining.pptx

  • 1. Chapter 10 Text Mining Dr. Steffen Herbold herbold@cs.uni-goettingen.de Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 2. Outline • Overview • Challenges for Text Mining • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 3. Example for Textual Data Introduction to Data Science https://sherbold.github.io/intro-to-data-science Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for iPhone] Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you! https://t.co/eQC2NqdIil [Twitter for iPhone] Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN rally. We need to elect @KarinHousley to the U.S. Senate, and we need the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and @PeteStauber in the U.S. House! [Twitter for iPhone] Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He helped pass tax reform which lowered taxes for EVERYONE! Nancy Pelosi is spending hundreds of thousands of dollars on his opponent because they both support a liberal agenda of higher taxes and wasteful spending! [Twitter for iPhone] Oct 4, 2018 02:29:27 PM “U.S. Stocks Widen Global Lead” https://t.co/Snhv08ulcO [Twitter for iPhone] Oct 4, 2018 02:17:28 PM Statement on National Strategy for Counterterrorism: https://t.co/ajFBg9Elsj https://t.co/Qr56ycjMAV [Twitter for iPhone] Oct 4, 2018 12:38:08 PM Working hard, thank you! https://t.co/6HQVaEXH0I [Twitter for iPhone] Oct 4, 2018 09:17:01 AM This is now the 7th. time the FBI has investigated Judge Kavanaugh. If we made it 100, it would still not be good enough for the Obstructionist Democrats. [Twitter for iPhone] Oct 4, 2018 09:01:13 AM RT @ChatByCC: While armed with the power of our Vote, we proudly & peacefully revolted Never doubt this was an American revolution #MAGA… [Twitter for iPhone] Oct 4, 2018 08:54:58 AM This is a very important time in our country. Due Process, Fairness and Common Sense are now on trial! [Twitter for iPhone] Oct 4, 2018 08:34:16 AM Our country’s great First Lady, Melania, is doing really well in Africa. The people love her, and she loves them! It is a beautiful thing to see. [Twitter for iPhone] Oct 4, 2018 07:16:41 AM The harsh and unfair treatment of Judge Brett Kavanaugh is having an incredible upward impact on voters. The PEOPLE get it far better than the politicians. Most importantly, this great life cannot be ruined by mean & despicable Democrats and totally uncorroborated allegations! [Twitter for iPhone] How do you analyze this?
  • 4. Text Mining in General • Inferring information from textual data • Techniques for text mining as diverse as the texts themselves! • The following demonstrates a relatively standard workflow for processing text data • Create corpus • Pre-process contents • Define bag-of-words Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 5. Documents and Corpus • Create documents to create a corpus Introduction to Data Science https://sherbold.github.io/intro-to-data-science Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for iPhone] Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you! https://t.co/eQC2NqdIil [Twitter for iPhone] Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN rally. We need to elect @KarinHousley to the U.S. Senate, and we need the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and @PeteStauber in the U.S. House! [Twitter for iPhone] Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He helped pass tax reform which lowered taxes for EVERYONE! Nancy Pelosi is spending hundreds of thousands of dollars on his opponent because they both support a liberal agenda of higher taxes and wasteful spending! [Twitter for iPhone] Oct 4, 2018 02:29:27 PM “U.S. Stocks Widen Global Lead” https://t.co/Snhv08ulcO [Twitter for iPhone] Oct 4, 2018 02:17:28 PM Statement on National Strategy for Counterterrorism: https://t.co/ajFBg9Elsj https://t.co/Qr56ycjMAV [Twitter for iPhone] Oct 4, 2018 12:38:08 PM Working hard, thank you! https://t.co/6HQVaEXH0I [Twitter for iPhone] Oct 4, 2018 09:17:01 AM This is now the 7th. time the FBI has investigated Judge Kavanaugh. If we made it 100, it would still not be good enough for the Obstructionist Democrats. [Twitter for iPhone] Each tweet a document All tweets together the corpus
  • 6. Unstructured Data  Structured Data • Identify relevant content  Remove irrelevant content Introduction to Data Science https://sherbold.github.io/intro-to-data-science Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for iPhone] Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you! https://t.co/eQC2NqdIil [Twitter for iPhone] Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN rally. We need to elect @KarinHousley to the U.S. Senate, and we need the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and @PeteStauber in the U.S. House! [Twitter for iPhone] Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He helped pass tax reform which lowered taxes for EVERYONE! Nancy Pelosi is spending hundreds of thousands of dollars on his opponent because they both support a liberal agenda of higher taxes and wasteful spending! [Twitter for iPhone] Oct 4, 2018 02:29:27 PM “U.S. Stocks Widen Global Lead” https://t.co/Snhv08ulcO [Twitter for iPhone] Oct 4, 2018 02:17:28 PM Statement on National Strategy for Counterterrorism: https://t.co/ajFBg9Elsj https://t.co/Qr56ycjMAV [Twitter for iPhone] Oct 4, 2018 12:38:08 PM Working hard, thank you! https://t.co/6HQVaEXH0I [Twitter for iPhone] Oct 4, 2018 09:17:01 AM This is now the 7th. time the FBI has investigated Judge Kavanaugh. If we made it 100, it would still not be good enough for the Obstructionist Democrats. [Twitter for iPhone] Drop shortened links Drop device Drop timestamp
  • 7. Punctuation and cases • Usually do not carry information  Can be removed Introduction to Data Science https://sherbold.github.io/intro-to-data-science beautiful evening in rochester minnesota vote vote vote thank you minnesota i love you just made my second stop in minnesota for a make america great again rally we need to elect karinhousley to the us senate and we need the strong leadership of tomemmer jason2cd jimhagedornmn and petestauber in the us house congressman bishop is doing a great job he helped pass tax reform which lowered taxes for everyone nancy pelosi is spending hundreds of thousands of dollars on his opponent because they both support a liberal agenda of higher taxes and wasteful spending us stocks widen global lead statement on national strategy for counterterrorism working hard thank you this is now the 7th time the fbi has investigated judge kavanaugh if we made it 100 it would still not be good enough for the obstructionist democrats
  • 8. Stop words • Most common words in a language (a, the, I, we, to, too, …)  Can be removed Introduction to Data Science https://sherbold.github.io/intro-to-data-science beautiful evening rochester minnesota vote vote vote thank minnesota love just made second stop minnesota make america great again rally need elect karinhousley senate need strong leadership tomemmer jason2cd jimhagedornmn petestauber house congressman bishop doing great job helped pass tax reform lowered taxes everyone nancy pelosi spending hundreds thousands dollars opponent both support liberal agenda higher taxes wasteful spending stocks widen global lead statement national strategy counterterrorism working hard thank now 7th time fbi investigated judge kavanaugh made 100 would still good enough obstructionist democrats
  • 9. Stemming and Lemmatization • Stemming: reduce terms to common stem (spending  spend) • Lemmatization: use common term for synonyms (better  good) Introduction to Data Science https://sherbold.github.io/intro-to-data-science beautiful evening rochester minnesota vote vote vote thank minnesota love just make second stop minnesota make america great again rally need elect karinhousley senate need strong leadership tomemmer jason2cd jimhagedornmn petestauber house congressman bishop do good job help pass tax reform lower tax everyone nancy pelosi spend hundred thousand dollar opponent both support liberal agenda high tax waste spend stock wide global lead statement national strategy counterterrorism work hard thank now 7th time fbi investigate judge kavanaugh make 100 would still good enough obstruct democrat
  • 10. Bag-of-Words beauti- ful evenin g roches -ter minne- sota vote thank love just make second stop amer- ica great … 1 1 1 1 3 0 0 0 0 0 0 0 0 … 0 0 0 1 0 1 1 0 0 0 0 0 0 … 0 0 0 1 0 0 0 0 2 1 1 1 1 … 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 1 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 1 0 0 0 0 … … … … … … … … … … … … … … … Introduction to Data Science https://sherbold.github.io/intro-to-data-science Every term one dimension Number of occurences in document
  • 11. Inverse Document Frequency • Absolute frequency of words problematic for discrimination • Favors words that occur often and in many documents • Similar to stop words • Inverse document frequency for the uniqueness of terms • 𝑖𝑑𝑓𝑖 = log 𝑁 𝐷𝑖 with 𝐷𝑖 the number of documents in which 𝑖 appears • Inverse document frequency to weight terms by uniqueness • 𝑡𝑓𝑖𝑑𝑓𝑖 = 𝑡𝑓𝑖 ⋅ 𝑖𝑑𝑓𝑖 Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 12. Downstream Analysis • Bag of words base structure • Allows many approaches for analysis • Classification • Usually requires manual labeling of data • Clustering • Sentiment analysis • Based on word counts • Information retrieval • Identification of related documents • Visualizations Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 13. Outline • Overview • Challenges for Text Mining • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 14. High Dimensional Data • Bag of words gets high dimensional extremely fast • Still over sixty different words after stemming/lemmatization in the tweet example • Can also have a huge amount of documents • >39000 tweets by donald trump in total • High dimension+many documents  very high runtime • Requires • Very efficient algorithms • Often only possible through massive parallelization • Naive Bayes is a popular choice Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 15. Ambiguities • Homonyms • break (take a break, break something) • Will all be grouped together by a bag of words • Semantic changes in interpretation of the same sentences • I hit the man with a stick. (Used a stick to hit the man) • I hit the man with a stick (I hit the man who was holding a stick) • Can often only be inferred from a greater context  Often impossible to resolve and leads to noise in the analysis Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 16. Capturing Syntax • Bag of words ignores syntax • Nine sentences with different meanings and the same words 1. Only he told his mistress that he loved her. (Nobody else did) 2. He only told his mistress that he loved her. (He didn't show her) 3. He told only his mistress that he loved her. (Kept it a secret from everyone else) 4. He told his only mistress that he loved her. (Stresses that he had only ONE!) 5. He told his mistress only that he loved her. (Didn't tell her anything else) 6. He told his mistress that only he loved her. ("I'm all you got, nobody else wants you.") 7. He told his mistress that he only loved her. (Not that he wanted to marry her.) 8. He told his mistress that he loved only her. (Yeah, don't they all...). 9. He told his mistress that he loved her only. (Similar to above one). • Capturing word orders and other relationships greatly increases the dimension • E.g., n-grams Introduction to Data Science https://sherbold.github.io/intro-to-data-science (bag of words over n-tuples)
  • 17. And the list goes on • Bad spelling • Slang • Synonyms not captured by lemmatization • Encodings and special characters • … Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 18. Outline • Overview • Challenges for Text Mining • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 19. Summary • Text mining is the analysis of textual data • Requires imposing a structure upon the unstructured texts • Bag of words • There are some standard techniques • Removing punctuation, cases, stopwords • Stemming/Lemmatization • Often also removing numbers and special characters  Depends on context! • Many problems due to noise in the textual data Introduction to Data Science https://sherbold.github.io/intro-to-data-science