SlideShare a Scribd company logo
1 of 19
Chapter 10
Text Mining
Dr. Steffen Herbold
herbold@cs.uni-goettingen.de
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• Challenges for Text Mining
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Example for Textual Data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for
iPhone]
Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you! https://t.co/eQC2NqdIil [Twitter for iPhone]
Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN rally. We need to elect
@KarinHousley to the U.S. Senate, and we need the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and
@PeteStauber in the U.S. House! [Twitter for iPhone]
Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He helped pass tax reform which lowered taxes for
EVERYONE! Nancy Pelosi is spending hundreds of thousands of dollars on his opponent because they both support a liberal
agenda of higher taxes and wasteful spending! [Twitter for iPhone]
Oct 4, 2018 02:29:27 PM “U.S. Stocks Widen Global Lead” https://t.co/Snhv08ulcO [Twitter for iPhone]
Oct 4, 2018 02:17:28 PM Statement on National Strategy for Counterterrorism: https://t.co/ajFBg9Elsj https://t.co/Qr56ycjMAV
[Twitter for iPhone]
Oct 4, 2018 12:38:08 PM Working hard, thank you! https://t.co/6HQVaEXH0I [Twitter for iPhone]
Oct 4, 2018 09:17:01 AM This is now the 7th. time the FBI has investigated Judge Kavanaugh. If we made it 100, it would still not
be good enough for the Obstructionist Democrats. [Twitter for iPhone]
Oct 4, 2018 09:01:13 AM RT @ChatByCC: While armed with the power of our Vote, we proudly & peacefully revolted Never
doubt this was an American revolution #MAGA… [Twitter for iPhone]
Oct 4, 2018 08:54:58 AM This is a very important time in our country. Due Process, Fairness and Common Sense are now on
trial! [Twitter for iPhone]
Oct 4, 2018 08:34:16 AM Our country’s great First Lady, Melania, is doing really well in Africa. The people love her, and she
loves them! It is a beautiful thing to see. [Twitter for iPhone]
Oct 4, 2018 07:16:41 AM The harsh and unfair treatment of Judge Brett Kavanaugh is having an incredible upward impact on
voters. The PEOPLE get it far better than the politicians. Most importantly, this great life cannot be ruined by mean & despicable
Democrats and totally uncorroborated allegations! [Twitter for iPhone]
How do you
analyze this?
Text Mining in General
• Inferring information from textual data
• Techniques for text mining as diverse as the texts themselves!
• The following demonstrates a relatively standard workflow for
processing text data
• Create corpus
• Pre-process contents
• Define bag-of-words
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Documents and Corpus
• Create documents to create a corpus
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for iPhone]
Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you! https://t.co/eQC2NqdIil [Twitter for iPhone]
Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN rally. We need to elect @KarinHousley to the U.S. Senate, and
we need the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and @PeteStauber in the U.S. House! [Twitter for iPhone]
Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He helped pass tax reform which lowered taxes for EVERYONE! Nancy Pelosi is spending hundreds
of thousands of dollars on his opponent because they both support a liberal agenda of higher taxes and wasteful spending! [Twitter for iPhone]
Oct 4, 2018 02:29:27 PM “U.S. Stocks Widen Global Lead” https://t.co/Snhv08ulcO [Twitter for iPhone]
Oct 4, 2018 02:17:28 PM Statement on National Strategy for Counterterrorism: https://t.co/ajFBg9Elsj https://t.co/Qr56ycjMAV [Twitter for iPhone]
Oct 4, 2018 12:38:08 PM Working hard, thank you! https://t.co/6HQVaEXH0I [Twitter for iPhone]
Oct 4, 2018 09:17:01 AM This is now the 7th. time the FBI has investigated Judge Kavanaugh. If we made it 100, it would still not be good enough for the Obstructionist
Democrats. [Twitter for iPhone]
Each tweet a document
All tweets together the
corpus
Unstructured Data  Structured Data
• Identify relevant content
 Remove irrelevant content
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for iPhone]
Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you! https://t.co/eQC2NqdIil [Twitter for iPhone]
Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN rally. We need to elect @KarinHousley to the U.S. Senate, and
we need the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and @PeteStauber in the U.S. House! [Twitter for iPhone]
Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He helped pass tax reform which lowered taxes for EVERYONE! Nancy Pelosi is spending hundreds
of thousands of dollars on his opponent because they both support a liberal agenda of higher taxes and wasteful spending! [Twitter for iPhone]
Oct 4, 2018 02:29:27 PM “U.S. Stocks Widen Global Lead” https://t.co/Snhv08ulcO [Twitter for iPhone]
Oct 4, 2018 02:17:28 PM Statement on National Strategy for Counterterrorism: https://t.co/ajFBg9Elsj https://t.co/Qr56ycjMAV [Twitter for iPhone]
Oct 4, 2018 12:38:08 PM Working hard, thank you! https://t.co/6HQVaEXH0I [Twitter for iPhone]
Oct 4, 2018 09:17:01 AM This is now the 7th. time the FBI has investigated Judge Kavanaugh. If we made it 100, it would still not be good enough for the Obstructionist
Democrats. [Twitter for iPhone]
Drop shortened links
Drop device
Drop timestamp
Punctuation and cases
• Usually do not carry information
 Can be removed
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
beautiful evening in rochester minnesota vote vote vote
thank you minnesota i love you
just made my second stop in minnesota for a make america great again rally we need to elect karinhousley to the us senate and we need the strong leadership of
tomemmer jason2cd jimhagedornmn and petestauber in the us house
congressman bishop is doing a great job he helped pass tax reform which lowered taxes for everyone nancy pelosi is spending hundreds of thousands of dollars on his
opponent because they both support a liberal agenda of higher taxes and wasteful spending
us stocks widen global lead
statement on national strategy for counterterrorism
working hard thank you
this is now the 7th time the fbi has investigated judge kavanaugh if we made it 100 it would still not be good enough for the obstructionist democrats
Stop words
• Most common words in a language (a, the, I, we, to, too, …)
 Can be removed
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
beautiful evening rochester minnesota vote vote vote
thank minnesota love
just made second stop minnesota make america great again rally need elect karinhousley senate need strong leadership tomemmer jason2cd jimhagedornmn petestauber
house
congressman bishop doing great job helped pass tax reform lowered taxes everyone nancy pelosi spending hundreds thousands dollars opponent both support liberal
agenda higher taxes wasteful spending
stocks widen global lead
statement national strategy counterterrorism
working hard thank
now 7th time fbi investigated judge kavanaugh made 100 would still good enough obstructionist democrats
Stemming and Lemmatization
• Stemming: reduce terms to common stem (spending  spend)
• Lemmatization: use common term for synonyms (better  good)
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
beautiful evening rochester minnesota vote vote vote
thank minnesota love
just make second stop minnesota make america great again rally need elect karinhousley senate need strong leadership tomemmer jason2cd jimhagedornmn petestauber
house
congressman bishop do good job help pass tax reform lower tax everyone nancy pelosi spend hundred thousand dollar opponent both support liberal agenda high tax
waste spend
stock wide global lead
statement national strategy counterterrorism
work hard thank
now 7th time fbi investigate judge kavanaugh make 100 would still good enough obstruct democrat
Bag-of-Words
beauti-
ful
evenin
g
roches
-ter
minne-
sota
vote thank love just make second stop amer-
ica
great …
1 1 1 1 3 0 0 0 0 0 0 0 0 …
0 0 0 1 0 1 1 0 0 0 0 0 0 …
0 0 0 1 0 0 0 0 2 1 1 1 1 …
0 0 0 0 0 0 0 0 0 0 0 0 0 …
0 0 0 0 0 0 0 0 0 0 0 0 0 …
0 0 0 0 0 1 0 0 0 0 0 0 0 …
0 0 0 0 0 0 0 0 1 0 0 0 0 …
… … … … … … … … … … … … … …
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Every term one
dimension
Number of occurences
in document
Inverse Document Frequency
• Absolute frequency of words problematic for discrimination
• Favors words that occur often and in many documents
• Similar to stop words
• Inverse document frequency for the uniqueness of terms
• 𝑖𝑑𝑓𝑖 = log
𝑁
𝐷𝑖
with 𝐷𝑖 the number of documents in which 𝑖 appears
• Inverse document frequency to weight terms by uniqueness
• 𝑡𝑓𝑖𝑑𝑓𝑖 = 𝑡𝑓𝑖 ⋅ 𝑖𝑑𝑓𝑖
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Downstream Analysis
• Bag of words base structure
• Allows many approaches for analysis
• Classification
• Usually requires manual labeling of data
• Clustering
• Sentiment analysis
• Based on word counts
• Information retrieval
• Identification of related documents
• Visualizations
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• Challenges for Text Mining
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
High Dimensional Data
• Bag of words gets high dimensional extremely fast
• Still over sixty different words after stemming/lemmatization in the
tweet example
• Can also have a huge amount of documents
• >39000 tweets by donald trump in total
• High dimension+many documents  very high runtime
• Requires
• Very efficient algorithms
• Often only possible through massive parallelization
• Naive Bayes is a popular choice
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Ambiguities
• Homonyms
• break (take a break, break something)
• Will all be grouped together by a bag of words
• Semantic changes in interpretation of the same sentences
• I hit the man with a stick. (Used a stick to hit the man)
• I hit the man with a stick (I hit the man who was holding a stick)
• Can often only be inferred from a greater context
 Often impossible to resolve and leads to noise in the analysis
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Capturing Syntax
• Bag of words ignores syntax
• Nine sentences with different meanings and the same words
1. Only he told his mistress that he loved her. (Nobody else did)
2. He only told his mistress that he loved her. (He didn't show her)
3. He told only his mistress that he loved her. (Kept it a secret from everyone else)
4. He told his only mistress that he loved her. (Stresses that he had only ONE!)
5. He told his mistress only that he loved her. (Didn't tell her anything else)
6. He told his mistress that only he loved her. ("I'm all you got, nobody else wants you.")
7. He told his mistress that he only loved her. (Not that he wanted to marry her.)
8. He told his mistress that he loved only her. (Yeah, don't they all...).
9. He told his mistress that he loved her only. (Similar to above one).
• Capturing word orders and other relationships greatly increases
the dimension
• E.g., n-grams
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
(bag of words over n-tuples)
And the list goes on
• Bad spelling
• Slang
• Synonyms not captured by lemmatization
• Encodings and special characters
• …
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• Challenges for Text Mining
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Summary
• Text mining is the analysis of textual data
• Requires imposing a structure upon the unstructured texts
• Bag of words
• There are some standard techniques
• Removing punctuation, cases, stopwords
• Stemming/Lemmatization
• Often also removing numbers and special characters
 Depends on context!
• Many problems due to noise in the textual data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science

More Related Content

Similar to 10-Text-Mining.pptx

What do Twitter conversations tell us about petitioning?
What do Twitter conversations tell us about petitioning?What do Twitter conversations tell us about petitioning?
What do Twitter conversations tell us about petitioning?UK Parliament Data
 
The language of social media
The language of social mediaThe language of social media
The language of social mediaDiana Maynard
 
Free Essay On Swot Analysis
Free Essay On Swot AnalysisFree Essay On Swot Analysis
Free Essay On Swot AnalysisLydia Jana
 
Steps For Writing A Persuasive Essay. Online assignment writing service.
Steps For Writing A Persuasive Essay. Online assignment writing service.Steps For Writing A Persuasive Essay. Online assignment writing service.
Steps For Writing A Persuasive Essay. Online assignment writing service.Navy Savchenko
 
Social Media: New Tool for Government Affairs
Social Media:  New Tool for Government AffairsSocial Media:  New Tool for Government Affairs
Social Media: New Tool for Government AffairsCurt Mercadante
 
Good Quality Writing Paper
Good Quality Writing PaperGood Quality Writing Paper
Good Quality Writing PaperTara Smith
 
002 Essay Example Purpose Of An Thatsnotus
002 Essay Example Purpose Of An Thatsnotus002 Essay Example Purpose Of An Thatsnotus
002 Essay Example Purpose Of An ThatsnotusJennifer Holmes
 
Essay Kazakhstan 2050. PDF Strategy quot;Kazakhstan 2050quot;: Basis of the ...
Essay Kazakhstan 2050. PDF Strategy quot;Kazakhstan  2050quot;: Basis of the ...Essay Kazakhstan 2050. PDF Strategy quot;Kazakhstan  2050quot;: Basis of the ...
Essay Kazakhstan 2050. PDF Strategy quot;Kazakhstan 2050quot;: Basis of the ...Ladonna Mayer
 
Narrative Essay Narrative Writing
Narrative Essay Narrative WritingNarrative Essay Narrative Writing
Narrative Essay Narrative WritingApril Wbnd
 
Postgraduate Essay Sample. How To Write A Masters Ess
Postgraduate Essay Sample. How To Write A Masters EssPostgraduate Essay Sample. How To Write A Masters Ess
Postgraduate Essay Sample. How To Write A Masters EssZaara Jensen
 
Sample Of Creative Self Introduction The Docume
Sample Of Creative Self Introduction The DocumeSample Of Creative Self Introduction The Docume
Sample Of Creative Self Introduction The DocumeMichele Thomas
 
Example Of Rogerian Argument Essay
Example Of Rogerian Argument EssayExample Of Rogerian Argument Essay
Example Of Rogerian Argument EssayAna Hall
 
Who Can Help Me Write An Essay - HelpcoachS Diary
Who Can Help Me Write An Essay - HelpcoachS DiaryWho Can Help Me Write An Essay - HelpcoachS Diary
Who Can Help Me Write An Essay - HelpcoachS DiaryDaniel Wachtel
 
EOCD Big Data Flows vs. Wicked Leaks
EOCD Big Data Flows vs. Wicked LeaksEOCD Big Data Flows vs. Wicked Leaks
EOCD Big Data Flows vs. Wicked LeaksJeff Jonas
 
Common Application Essay Examples
Common Application Essay ExamplesCommon Application Essay Examples
Common Application Essay ExamplesJennifer Reese
 
Master Degree Sample Personal Statement Sampl
Master Degree Sample Personal Statement SamplMaster Degree Sample Personal Statement Sampl
Master Degree Sample Personal Statement SamplMelissa Kula
 
Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...Diana Maynard
 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with DataRitvvij Parrikh
 
Write An Expository Essay On How To M
Write An Expository Essay On How To MWrite An Expository Essay On How To M
Write An Expository Essay On How To MBrenda Thomas
 

Similar to 10-Text-Mining.pptx (20)

What do Twitter conversations tell us about petitioning?
What do Twitter conversations tell us about petitioning?What do Twitter conversations tell us about petitioning?
What do Twitter conversations tell us about petitioning?
 
The language of social media
The language of social mediaThe language of social media
The language of social media
 
2007 open everything at gnomedex 4.4
2007 open everything at gnomedex 4.42007 open everything at gnomedex 4.4
2007 open everything at gnomedex 4.4
 
Free Essay On Swot Analysis
Free Essay On Swot AnalysisFree Essay On Swot Analysis
Free Essay On Swot Analysis
 
Steps For Writing A Persuasive Essay. Online assignment writing service.
Steps For Writing A Persuasive Essay. Online assignment writing service.Steps For Writing A Persuasive Essay. Online assignment writing service.
Steps For Writing A Persuasive Essay. Online assignment writing service.
 
Social Media: New Tool for Government Affairs
Social Media:  New Tool for Government AffairsSocial Media:  New Tool for Government Affairs
Social Media: New Tool for Government Affairs
 
Good Quality Writing Paper
Good Quality Writing PaperGood Quality Writing Paper
Good Quality Writing Paper
 
002 Essay Example Purpose Of An Thatsnotus
002 Essay Example Purpose Of An Thatsnotus002 Essay Example Purpose Of An Thatsnotus
002 Essay Example Purpose Of An Thatsnotus
 
Essay Kazakhstan 2050. PDF Strategy quot;Kazakhstan 2050quot;: Basis of the ...
Essay Kazakhstan 2050. PDF Strategy quot;Kazakhstan  2050quot;: Basis of the ...Essay Kazakhstan 2050. PDF Strategy quot;Kazakhstan  2050quot;: Basis of the ...
Essay Kazakhstan 2050. PDF Strategy quot;Kazakhstan 2050quot;: Basis of the ...
 
Narrative Essay Narrative Writing
Narrative Essay Narrative WritingNarrative Essay Narrative Writing
Narrative Essay Narrative Writing
 
Postgraduate Essay Sample. How To Write A Masters Ess
Postgraduate Essay Sample. How To Write A Masters EssPostgraduate Essay Sample. How To Write A Masters Ess
Postgraduate Essay Sample. How To Write A Masters Ess
 
Sample Of Creative Self Introduction The Docume
Sample Of Creative Self Introduction The DocumeSample Of Creative Self Introduction The Docume
Sample Of Creative Self Introduction The Docume
 
Example Of Rogerian Argument Essay
Example Of Rogerian Argument EssayExample Of Rogerian Argument Essay
Example Of Rogerian Argument Essay
 
Who Can Help Me Write An Essay - HelpcoachS Diary
Who Can Help Me Write An Essay - HelpcoachS DiaryWho Can Help Me Write An Essay - HelpcoachS Diary
Who Can Help Me Write An Essay - HelpcoachS Diary
 
EOCD Big Data Flows vs. Wicked Leaks
EOCD Big Data Flows vs. Wicked LeaksEOCD Big Data Flows vs. Wicked Leaks
EOCD Big Data Flows vs. Wicked Leaks
 
Common Application Essay Examples
Common Application Essay ExamplesCommon Application Essay Examples
Common Application Essay Examples
 
Master Degree Sample Personal Statement Sampl
Master Degree Sample Personal Statement SamplMaster Degree Sample Personal Statement Sampl
Master Degree Sample Personal Statement Sampl
 
Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...
 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with Data
 
Write An Expository Essay On How To M
Write An Expository Essay On How To MWrite An Expository Essay On How To M
Write An Expository Essay On How To M
 

More from Shree Shree

More from Shree Shree (20)

10 rectangular options.pptx
10 rectangular options.pptx10 rectangular options.pptx
10 rectangular options.pptx
 
1732002B.pptx
1732002B.pptx1732002B.pptx
1732002B.pptx
 
9D52BDFC.pptx
9D52BDFC.pptx9D52BDFC.pptx
9D52BDFC.pptx
 
865FC50.pptx
865FC50.pptx865FC50.pptx
865FC50.pptx
 
9DE46AD6.pptx
9DE46AD6.pptx9DE46AD6.pptx
9DE46AD6.pptx
 
20DE6DE0.pptx
20DE6DE0.pptx20DE6DE0.pptx
20DE6DE0.pptx
 
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
 
2C0A56C5.pptx
2C0A56C5.pptx2C0A56C5.pptx
2C0A56C5.pptx
 
1EF5E174.pptx
1EF5E174.pptx1EF5E174.pptx
1EF5E174.pptx
 
1BB40E18.pptx
1BB40E18.pptx1BB40E18.pptx
1BB40E18.pptx
 
8D0ECBB0.pptx
8D0ECBB0.pptx8D0ECBB0.pptx
8D0ECBB0.pptx
 
D1586BA6.pptx
D1586BA6.pptxD1586BA6.pptx
D1586BA6.pptx
 
FB09958.pptx
FB09958.pptxFB09958.pptx
FB09958.pptx
 
13072EF.pptx
13072EF.pptx13072EF.pptx
13072EF.pptx
 
E65B78F3.pptx
E65B78F3.pptxE65B78F3.pptx
E65B78F3.pptx
 
355C162.pptx
355C162.pptx355C162.pptx
355C162.pptx
 
795589F6.pptx
795589F6.pptx795589F6.pptx
795589F6.pptx
 
AD177DF3.pptx
AD177DF3.pptxAD177DF3.pptx
AD177DF3.pptx
 
961F1B3A.pptx
961F1B3A.pptx961F1B3A.pptx
961F1B3A.pptx
 
5F2D461.pptx
5F2D461.pptx5F2D461.pptx
5F2D461.pptx
 

Recently uploaded

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 

Recently uploaded (20)

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 

10-Text-Mining.pptx

  • 1. Chapter 10 Text Mining Dr. Steffen Herbold herbold@cs.uni-goettingen.de Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 2. Outline • Overview • Challenges for Text Mining • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 3. Example for Textual Data Introduction to Data Science https://sherbold.github.io/intro-to-data-science Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for iPhone] Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you! https://t.co/eQC2NqdIil [Twitter for iPhone] Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN rally. We need to elect @KarinHousley to the U.S. Senate, and we need the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and @PeteStauber in the U.S. House! [Twitter for iPhone] Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He helped pass tax reform which lowered taxes for EVERYONE! Nancy Pelosi is spending hundreds of thousands of dollars on his opponent because they both support a liberal agenda of higher taxes and wasteful spending! [Twitter for iPhone] Oct 4, 2018 02:29:27 PM “U.S. Stocks Widen Global Lead” https://t.co/Snhv08ulcO [Twitter for iPhone] Oct 4, 2018 02:17:28 PM Statement on National Strategy for Counterterrorism: https://t.co/ajFBg9Elsj https://t.co/Qr56ycjMAV [Twitter for iPhone] Oct 4, 2018 12:38:08 PM Working hard, thank you! https://t.co/6HQVaEXH0I [Twitter for iPhone] Oct 4, 2018 09:17:01 AM This is now the 7th. time the FBI has investigated Judge Kavanaugh. If we made it 100, it would still not be good enough for the Obstructionist Democrats. [Twitter for iPhone] Oct 4, 2018 09:01:13 AM RT @ChatByCC: While armed with the power of our Vote, we proudly & peacefully revolted Never doubt this was an American revolution #MAGA… [Twitter for iPhone] Oct 4, 2018 08:54:58 AM This is a very important time in our country. Due Process, Fairness and Common Sense are now on trial! [Twitter for iPhone] Oct 4, 2018 08:34:16 AM Our country’s great First Lady, Melania, is doing really well in Africa. The people love her, and she loves them! It is a beautiful thing to see. [Twitter for iPhone] Oct 4, 2018 07:16:41 AM The harsh and unfair treatment of Judge Brett Kavanaugh is having an incredible upward impact on voters. The PEOPLE get it far better than the politicians. Most importantly, this great life cannot be ruined by mean & despicable Democrats and totally uncorroborated allegations! [Twitter for iPhone] How do you analyze this?
  • 4. Text Mining in General • Inferring information from textual data • Techniques for text mining as diverse as the texts themselves! • The following demonstrates a relatively standard workflow for processing text data • Create corpus • Pre-process contents • Define bag-of-words Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 5. Documents and Corpus • Create documents to create a corpus Introduction to Data Science https://sherbold.github.io/intro-to-data-science Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for iPhone] Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you! https://t.co/eQC2NqdIil [Twitter for iPhone] Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN rally. We need to elect @KarinHousley to the U.S. Senate, and we need the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and @PeteStauber in the U.S. House! [Twitter for iPhone] Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He helped pass tax reform which lowered taxes for EVERYONE! Nancy Pelosi is spending hundreds of thousands of dollars on his opponent because they both support a liberal agenda of higher taxes and wasteful spending! [Twitter for iPhone] Oct 4, 2018 02:29:27 PM “U.S. Stocks Widen Global Lead” https://t.co/Snhv08ulcO [Twitter for iPhone] Oct 4, 2018 02:17:28 PM Statement on National Strategy for Counterterrorism: https://t.co/ajFBg9Elsj https://t.co/Qr56ycjMAV [Twitter for iPhone] Oct 4, 2018 12:38:08 PM Working hard, thank you! https://t.co/6HQVaEXH0I [Twitter for iPhone] Oct 4, 2018 09:17:01 AM This is now the 7th. time the FBI has investigated Judge Kavanaugh. If we made it 100, it would still not be good enough for the Obstructionist Democrats. [Twitter for iPhone] Each tweet a document All tweets together the corpus
  • 6. Unstructured Data  Structured Data • Identify relevant content  Remove irrelevant content Introduction to Data Science https://sherbold.github.io/intro-to-data-science Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for iPhone] Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you! https://t.co/eQC2NqdIil [Twitter for iPhone] Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN rally. We need to elect @KarinHousley to the U.S. Senate, and we need the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and @PeteStauber in the U.S. House! [Twitter for iPhone] Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He helped pass tax reform which lowered taxes for EVERYONE! Nancy Pelosi is spending hundreds of thousands of dollars on his opponent because they both support a liberal agenda of higher taxes and wasteful spending! [Twitter for iPhone] Oct 4, 2018 02:29:27 PM “U.S. Stocks Widen Global Lead” https://t.co/Snhv08ulcO [Twitter for iPhone] Oct 4, 2018 02:17:28 PM Statement on National Strategy for Counterterrorism: https://t.co/ajFBg9Elsj https://t.co/Qr56ycjMAV [Twitter for iPhone] Oct 4, 2018 12:38:08 PM Working hard, thank you! https://t.co/6HQVaEXH0I [Twitter for iPhone] Oct 4, 2018 09:17:01 AM This is now the 7th. time the FBI has investigated Judge Kavanaugh. If we made it 100, it would still not be good enough for the Obstructionist Democrats. [Twitter for iPhone] Drop shortened links Drop device Drop timestamp
  • 7. Punctuation and cases • Usually do not carry information  Can be removed Introduction to Data Science https://sherbold.github.io/intro-to-data-science beautiful evening in rochester minnesota vote vote vote thank you minnesota i love you just made my second stop in minnesota for a make america great again rally we need to elect karinhousley to the us senate and we need the strong leadership of tomemmer jason2cd jimhagedornmn and petestauber in the us house congressman bishop is doing a great job he helped pass tax reform which lowered taxes for everyone nancy pelosi is spending hundreds of thousands of dollars on his opponent because they both support a liberal agenda of higher taxes and wasteful spending us stocks widen global lead statement on national strategy for counterterrorism working hard thank you this is now the 7th time the fbi has investigated judge kavanaugh if we made it 100 it would still not be good enough for the obstructionist democrats
  • 8. Stop words • Most common words in a language (a, the, I, we, to, too, …)  Can be removed Introduction to Data Science https://sherbold.github.io/intro-to-data-science beautiful evening rochester minnesota vote vote vote thank minnesota love just made second stop minnesota make america great again rally need elect karinhousley senate need strong leadership tomemmer jason2cd jimhagedornmn petestauber house congressman bishop doing great job helped pass tax reform lowered taxes everyone nancy pelosi spending hundreds thousands dollars opponent both support liberal agenda higher taxes wasteful spending stocks widen global lead statement national strategy counterterrorism working hard thank now 7th time fbi investigated judge kavanaugh made 100 would still good enough obstructionist democrats
  • 9. Stemming and Lemmatization • Stemming: reduce terms to common stem (spending  spend) • Lemmatization: use common term for synonyms (better  good) Introduction to Data Science https://sherbold.github.io/intro-to-data-science beautiful evening rochester minnesota vote vote vote thank minnesota love just make second stop minnesota make america great again rally need elect karinhousley senate need strong leadership tomemmer jason2cd jimhagedornmn petestauber house congressman bishop do good job help pass tax reform lower tax everyone nancy pelosi spend hundred thousand dollar opponent both support liberal agenda high tax waste spend stock wide global lead statement national strategy counterterrorism work hard thank now 7th time fbi investigate judge kavanaugh make 100 would still good enough obstruct democrat
  • 10. Bag-of-Words beauti- ful evenin g roches -ter minne- sota vote thank love just make second stop amer- ica great … 1 1 1 1 3 0 0 0 0 0 0 0 0 … 0 0 0 1 0 1 1 0 0 0 0 0 0 … 0 0 0 1 0 0 0 0 2 1 1 1 1 … 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 1 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 1 0 0 0 0 … … … … … … … … … … … … … … … Introduction to Data Science https://sherbold.github.io/intro-to-data-science Every term one dimension Number of occurences in document
  • 11. Inverse Document Frequency • Absolute frequency of words problematic for discrimination • Favors words that occur often and in many documents • Similar to stop words • Inverse document frequency for the uniqueness of terms • 𝑖𝑑𝑓𝑖 = log 𝑁 𝐷𝑖 with 𝐷𝑖 the number of documents in which 𝑖 appears • Inverse document frequency to weight terms by uniqueness • 𝑡𝑓𝑖𝑑𝑓𝑖 = 𝑡𝑓𝑖 ⋅ 𝑖𝑑𝑓𝑖 Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 12. Downstream Analysis • Bag of words base structure • Allows many approaches for analysis • Classification • Usually requires manual labeling of data • Clustering • Sentiment analysis • Based on word counts • Information retrieval • Identification of related documents • Visualizations Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 13. Outline • Overview • Challenges for Text Mining • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 14. High Dimensional Data • Bag of words gets high dimensional extremely fast • Still over sixty different words after stemming/lemmatization in the tweet example • Can also have a huge amount of documents • >39000 tweets by donald trump in total • High dimension+many documents  very high runtime • Requires • Very efficient algorithms • Often only possible through massive parallelization • Naive Bayes is a popular choice Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 15. Ambiguities • Homonyms • break (take a break, break something) • Will all be grouped together by a bag of words • Semantic changes in interpretation of the same sentences • I hit the man with a stick. (Used a stick to hit the man) • I hit the man with a stick (I hit the man who was holding a stick) • Can often only be inferred from a greater context  Often impossible to resolve and leads to noise in the analysis Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 16. Capturing Syntax • Bag of words ignores syntax • Nine sentences with different meanings and the same words 1. Only he told his mistress that he loved her. (Nobody else did) 2. He only told his mistress that he loved her. (He didn't show her) 3. He told only his mistress that he loved her. (Kept it a secret from everyone else) 4. He told his only mistress that he loved her. (Stresses that he had only ONE!) 5. He told his mistress only that he loved her. (Didn't tell her anything else) 6. He told his mistress that only he loved her. ("I'm all you got, nobody else wants you.") 7. He told his mistress that he only loved her. (Not that he wanted to marry her.) 8. He told his mistress that he loved only her. (Yeah, don't they all...). 9. He told his mistress that he loved her only. (Similar to above one). • Capturing word orders and other relationships greatly increases the dimension • E.g., n-grams Introduction to Data Science https://sherbold.github.io/intro-to-data-science (bag of words over n-tuples)
  • 17. And the list goes on • Bad spelling • Slang • Synonyms not captured by lemmatization • Encodings and special characters • … Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 18. Outline • Overview • Challenges for Text Mining • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 19. Summary • Text mining is the analysis of textual data • Requires imposing a structure upon the unstructured texts • Bag of words • There are some standard techniques • Removing punctuation, cases, stopwords • Stemming/Lemmatization • Often also removing numbers and special characters  Depends on context! • Many problems due to noise in the textual data Introduction to Data Science https://sherbold.github.io/intro-to-data-science