SlideShare a Scribd company logo
Bag-of-Words Methods
for
Text Mining
CSCI-GA.2590 – Lecture 2A
Ralph Grishman
NYU
Bag of Words Models
• do we really need elaborate linguistic analysis?
• look at text mining applications
– document retrieval
– opinion mining
– association mining
• see how far we can get with document-level bag-
of-words models
– and introduce some of our mathematical approaches
1/16/14 NYU 2
• document retrieval
• opinion mining
• association mining
1/16/14 NYU 3
Information Retrieval
• Task: given query = list of keywords, identify
and rank relevant documents from collection
• Basic idea:
find documents whose set of words most
closely matches words in query
1/16/14 NYU 4
Topic Vector
• Suppose the document collection has n distinct words,
w1, …, wn
• Each document is characterized by an n-dimensional
vector whose ith component is the frequency of word
wi in the document
Example
• D1 = [The cat chased the mouse.]
• D2 = [The dog chased the cat.]
• W = [The, chased, dog, cat, mouse] (n = 5)
• V1 = [ 2 , 1 , 0 , 1 , 1 ]
• V2 = [ 2 , 1 , 1 , 1 , 0 ]
1/16/14 NYU 5
Weighting the components
• Unusual words like elephant determine the topic much more
than common words such as “the” or “have”
• can ignore words on a stop list or
• weight each term frequency tfi by its inverse document
frequency idfi
• where N = size of collection and ni = number of documents
containing term i
1/16/14 NYU 6

witfiidfi

idfilog(N
ni
)
Cosine similarity metric
Define a similarity metric between topic vectors
A common choice is cosine similarity (dot
product):
1/16/14 NYU 7

sim(A,B)
aibi
i

ai
2
i
  bi
2
i

Cosine similarity metric
• the cosine similarity metric is the cosine of the
angle between the term vectors:
1/16/14 NYU 8
w1
w2
Verdict: a success
• For heterogenous text collections, the vector
space model, tf-idf weighting, and cosine
similarity have been the basis for successful
document retrieval for over 50 years
• document retrieval
• opinion mining
• association mining
1/16/14 NYU 10
Opinion Mining
– Task: judge whether a document expresses a positive
or negative opinion (or no opinion) about an object or
topic
• classification task
• valuable for producers/marketers of all sorts of products
– Simple strategy: bag-of-words
• make lists of positive and negative words
• see which predominate in a given document
(and mark as ‘no opinion’ if there are few words of either
type
• problem: hard to make such lists
– lists will differ for different topics
1/16/14 11
NYU
Training a Model
– Instead, label the documents in a corpus and then
train a classifier from the corpus
• labeling is easier than thinking up words
• in effect, learn positive and negative words from the
corpus
– probabilistic classifier
• identify most likely class
s = argmax P ( t | W)
t є {pos, neg}
1/16/14 12
NYU
Using Bayes’ Rule
• The last step is based on the naïve assumption
of independence of the word probabilities
1/16/14 NYU 13

argmaxt P(t|W)
argmaxt P(W |t)P(t)
P(W)
argmaxt P(W |t)P(t)
argmaxt P(w1,...,wn |t)P(t)
argmaxt P(wi |t)P(t)
i

Training
• We now estimate these probabilities from the
training corpus (of N documents) using maximum
likelihood estimators
• P ( t ) = count ( docs labeled t) / N
• P ( wi | t ) =
count ( docs labeled t containing wi )
count ( docs labeled t)
1/16/14 NYU 14
A Problem
• Suppose a glowing review GR (with lots of
positive words) includes one word,
“mathematical”, previously seen only in
negative reviews
• P ( positive | GR ) = ?
1/16/14 NYU 15
P ( positive | GR ) = 0
because P ( “mathematical” | positive ) = 0
The maximum likelihood estimate is poor when
there is very little data
We need to ‘smooth’ the probabilities to avoid
this problem
1/16/14 NYU 16
Laplace Smoothing
• A simple remedy is to add 1 to each count
– to keep them as probabilities
we increase the denominator N by the number of outcomes
(values of t) (2 for ‘positive’ and ‘negative’)
– for the conditional probabilities P( w | t ) there are similarly two
outcomes (w is present or absent)
1/16/14 NYU 17
 
t
t
P 1
)
(
An NLTK Demo
Sentiment Analysis with Python NLTK Text
Classification
• http://text-processing.com/demo/sentiment/
NLTK Code (simplified classifier)
• http://streamhacker.com/2010/05/10/text-
classification-sentiment-analysis-naive-bayes-classifier
1/16/14 NYU 18
• Is “low” a positive or a negative term?
1/16/14 NYU 19
Ambiguous terms
“low” can be positive
“low price”
or negative
“low quality”
1/16/14 NYU 20
Verdict: mixed
• A simple bag-of-words strategy with a NB
model works quite well for simple reviews
referring to a single item, but fails
– for ambiguous terms
– for comparative reviews
– to reveal aspects of an opinion
• the car looked great and handled well,
but the wheels kept falling off
1/16/14 NYU 21
• document retrieval
• opinion mining
• association mining
1/16/14 NYU 22
Association Mining
• Goal: find interesting relationships among attributes
of an object in a large collection …
objects with attribute A also have attribute B
– e.g., “people who bought A also bought B”
• For text: documents with term A also have term B
– widely used in scientific and medical literature
1/16/14 NYU 23
Bag-of-words
• Simplest approach
• look for words A and B for which
frequency (A and B in same document) >>
frequency of A x frequency of B
• doesn’t work well
• want to find names (of companies, products, genes),
not individual words
• interested in specific types of terms
• want to learn from a few examples
– need contexts to avoid noise
1/16/14 NYU 24
Needed Ingredients
• Effective text association mining needs
– name recognition
– term classification
– preferably: ability to learn patterns
– preferably: a good GUI
– demo: www.coremine.com
1/16/14 NYU 25
Conclusion
• Some tasks can be handled effectively (and
very simply) by bag-of-words models,
• but most benefit from an analysis of language
structure
1/16/14 NYU 26

More Related Content

Similar to 4578663.ppt

08. EDT 513 2023 Week 8.pptx
08. EDT 513 2023 Week 8.pptx08. EDT 513 2023 Week 8.pptx
08. EDT 513 2023 Week 8.pptx
Gambari Amosa Isiaka
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
rchbeir
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
M.saleem literature review
M.saleem  literature reviewM.saleem  literature review
M.saleem literature review
E&S Education Department, KP
 
Lesson 4 secondary research 2
Lesson 4   secondary research 2Lesson 4   secondary research 2
Lesson 4 secondary research 2
Kavita Parwani
 
information-skills-for-researchers-v3
information-skills-for-researchers-v3information-skills-for-researchers-v3
information-skills-for-researchers-v3
Jacqueline Thomas
 
Reading and writing at m level for scitt
Reading and writing at m level for scittReading and writing at m level for scitt
Reading and writing at m level for scitt
Philwood
 
Literature Review(For Young Researchers)
Literature Review(For Young Researchers)Literature Review(For Young Researchers)
Literature Review(For Young Researchers)
DrAmitPurushottam
 
STEM Mom Speaks to Teachers at Princeton University
STEM Mom Speaks to Teachers at Princeton University STEM Mom Speaks to Teachers at Princeton University
STEM Mom Speaks to Teachers at Princeton University
Darci the STEM Mom
 
Tools that Encourage Criticism - Leiden University Symposium on Tools Criticism
Tools that Encourage Criticism - Leiden University Symposium on Tools CriticismTools that Encourage Criticism - Leiden University Symposium on Tools Criticism
Tools that Encourage Criticism - Leiden University Symposium on Tools Criticism
Marijn Koolen
 
Chapter 2 ARCHITECTURAL RESEARCH METHODOLOGY
Chapter 2 ARCHITECTURAL RESEARCH METHODOLOGYChapter 2 ARCHITECTURAL RESEARCH METHODOLOGY
Chapter 2 ARCHITECTURAL RESEARCH METHODOLOGY
Hazrina Haja
 
DOK J. (A.R. PRESENTATION) part rwo.pptx
DOK J. (A.R. PRESENTATION) part rwo.pptxDOK J. (A.R. PRESENTATION) part rwo.pptx
DOK J. (A.R. PRESENTATION) part rwo.pptx
AljohnEspejo1
 
Literature Review PPT.pptx
Literature Review PPT.pptxLiterature Review PPT.pptx
Literature Review PPT.pptx
MuhayyoMadirimova
 
Xiaowei Zhou
Xiaowei ZhouXiaowei Zhou
Getting a Head Start with Your Thesis Proposal
Getting a Head Start with Your Thesis ProposalGetting a Head Start with Your Thesis Proposal
Getting a Head Start with Your Thesis Proposal
wwuwritingcenter
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval
GESIS
 
Introduction and Tools of Research
Introduction and Tools of ResearchIntroduction and Tools of Research
Introduction and Tools of Research
Myke Evans
 
Lesson 3 - Secondary Research 1
Lesson 3 - Secondary Research 1Lesson 3 - Secondary Research 1
Lesson 3 - Secondary Research 1
Kavita Parwani
 
Getting started; Literature review (Based on Cresswell) Research Methodology
Getting started; Literature review (Based on Cresswell) Research MethodologyGetting started; Literature review (Based on Cresswell) Research Methodology
Getting started; Literature review (Based on Cresswell) Research Methodology
Rajkumar Tyata
 
Lecture-2.pdf
Lecture-2.pdfLecture-2.pdf
Lecture-2.pdf
MohammadIslam814740
 

Similar to 4578663.ppt (20)

08. EDT 513 2023 Week 8.pptx
08. EDT 513 2023 Week 8.pptx08. EDT 513 2023 Week 8.pptx
08. EDT 513 2023 Week 8.pptx
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
M.saleem literature review
M.saleem  literature reviewM.saleem  literature review
M.saleem literature review
 
Lesson 4 secondary research 2
Lesson 4   secondary research 2Lesson 4   secondary research 2
Lesson 4 secondary research 2
 
information-skills-for-researchers-v3
information-skills-for-researchers-v3information-skills-for-researchers-v3
information-skills-for-researchers-v3
 
Reading and writing at m level for scitt
Reading and writing at m level for scittReading and writing at m level for scitt
Reading and writing at m level for scitt
 
Literature Review(For Young Researchers)
Literature Review(For Young Researchers)Literature Review(For Young Researchers)
Literature Review(For Young Researchers)
 
STEM Mom Speaks to Teachers at Princeton University
STEM Mom Speaks to Teachers at Princeton University STEM Mom Speaks to Teachers at Princeton University
STEM Mom Speaks to Teachers at Princeton University
 
Tools that Encourage Criticism - Leiden University Symposium on Tools Criticism
Tools that Encourage Criticism - Leiden University Symposium on Tools CriticismTools that Encourage Criticism - Leiden University Symposium on Tools Criticism
Tools that Encourage Criticism - Leiden University Symposium on Tools Criticism
 
Chapter 2 ARCHITECTURAL RESEARCH METHODOLOGY
Chapter 2 ARCHITECTURAL RESEARCH METHODOLOGYChapter 2 ARCHITECTURAL RESEARCH METHODOLOGY
Chapter 2 ARCHITECTURAL RESEARCH METHODOLOGY
 
DOK J. (A.R. PRESENTATION) part rwo.pptx
DOK J. (A.R. PRESENTATION) part rwo.pptxDOK J. (A.R. PRESENTATION) part rwo.pptx
DOK J. (A.R. PRESENTATION) part rwo.pptx
 
Literature Review PPT.pptx
Literature Review PPT.pptxLiterature Review PPT.pptx
Literature Review PPT.pptx
 
Xiaowei Zhou
Xiaowei ZhouXiaowei Zhou
Xiaowei Zhou
 
Getting a Head Start with Your Thesis Proposal
Getting a Head Start with Your Thesis ProposalGetting a Head Start with Your Thesis Proposal
Getting a Head Start with Your Thesis Proposal
 
Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval Intra- and interdisciplinary cross-concordances for information retrieval
Intra- and interdisciplinary cross-concordances for information retrieval
 
Introduction and Tools of Research
Introduction and Tools of ResearchIntroduction and Tools of Research
Introduction and Tools of Research
 
Lesson 3 - Secondary Research 1
Lesson 3 - Secondary Research 1Lesson 3 - Secondary Research 1
Lesson 3 - Secondary Research 1
 
Getting started; Literature review (Based on Cresswell) Research Methodology
Getting started; Literature review (Based on Cresswell) Research MethodologyGetting started; Literature review (Based on Cresswell) Research Methodology
Getting started; Literature review (Based on Cresswell) Research Methodology
 
Lecture-2.pdf
Lecture-2.pdfLecture-2.pdf
Lecture-2.pdf
 

More from nyomans1

PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptPPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
nyomans1
 
Template Pertemuan 1 All MK - Copy.pptx
Template Pertemuan 1 All MK - Copy.pptxTemplate Pertemuan 1 All MK - Copy.pptx
Template Pertemuan 1 All MK - Copy.pptx
nyomans1
 
Clustering_hirarki (tanpa narasi) (1).pptx
Clustering_hirarki (tanpa narasi) (1).pptxClustering_hirarki (tanpa narasi) (1).pptx
Clustering_hirarki (tanpa narasi) (1).pptx
nyomans1
 
slide 7_olap_example.ppt
slide 7_olap_example.pptslide 7_olap_example.ppt
slide 7_olap_example.ppt
nyomans1
 
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptPPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
nyomans1
 
Security Requirement.pptx
Security Requirement.pptxSecurity Requirement.pptx
Security Requirement.pptx
nyomans1
 
Minggu_1_Matriks_dan_Operasinya.pptx
Minggu_1_Matriks_dan_Operasinya.pptxMinggu_1_Matriks_dan_Operasinya.pptx
Minggu_1_Matriks_dan_Operasinya.pptx
nyomans1
 
Matriks suplemen.ppt
Matriks suplemen.pptMatriks suplemen.ppt
Matriks suplemen.ppt
nyomans1
 
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
nyomans1
 
10-Image-Enhancement-Bagian3-2021.pptx
10-Image-Enhancement-Bagian3-2021.pptx10-Image-Enhancement-Bagian3-2021.pptx
10-Image-Enhancement-Bagian3-2021.pptx
nyomans1
 
08-Image-Enhancement-Bagian1.pptx
08-Image-Enhancement-Bagian1.pptx08-Image-Enhancement-Bagian1.pptx
08-Image-Enhancement-Bagian1.pptx
nyomans1
 
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
nyomans1
 
04-Format-citra-dan-struktur-data-citra-2021.pptx
04-Format-citra-dan-struktur-data-citra-2021.pptx04-Format-citra-dan-struktur-data-citra-2021.pptx
04-Format-citra-dan-struktur-data-citra-2021.pptx
nyomans1
 
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
nyomans1
 
03spatialfiltering-130424050639-phpapp02.pptx
03spatialfiltering-130424050639-phpapp02.pptx03spatialfiltering-130424050639-phpapp02.pptx
03spatialfiltering-130424050639-phpapp02.pptx
nyomans1
 
Q-Step_WS_02102019_Practical_introduction_to_Python.pptx
Q-Step_WS_02102019_Practical_introduction_to_Python.pptxQ-Step_WS_02102019_Practical_introduction_to_Python.pptx
Q-Step_WS_02102019_Practical_introduction_to_Python.pptx
nyomans1
 
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptxBAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
nyomans1
 
Support-Vector-Machines_EJ_v5.06.pptx
Support-Vector-Machines_EJ_v5.06.pptxSupport-Vector-Machines_EJ_v5.06.pptx
Support-Vector-Machines_EJ_v5.06.pptx
nyomans1
 
06-Image-Histogram-2021.pptx
06-Image-Histogram-2021.pptx06-Image-Histogram-2021.pptx
06-Image-Histogram-2021.pptx
nyomans1
 
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
nyomans1
 

More from nyomans1 (20)

PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptPPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
 
Template Pertemuan 1 All MK - Copy.pptx
Template Pertemuan 1 All MK - Copy.pptxTemplate Pertemuan 1 All MK - Copy.pptx
Template Pertemuan 1 All MK - Copy.pptx
 
Clustering_hirarki (tanpa narasi) (1).pptx
Clustering_hirarki (tanpa narasi) (1).pptxClustering_hirarki (tanpa narasi) (1).pptx
Clustering_hirarki (tanpa narasi) (1).pptx
 
slide 7_olap_example.ppt
slide 7_olap_example.pptslide 7_olap_example.ppt
slide 7_olap_example.ppt
 
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.pptPPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
PPT-UEU-Keamanan-Informasi-Pertemuan-5.ppt
 
Security Requirement.pptx
Security Requirement.pptxSecurity Requirement.pptx
Security Requirement.pptx
 
Minggu_1_Matriks_dan_Operasinya.pptx
Minggu_1_Matriks_dan_Operasinya.pptxMinggu_1_Matriks_dan_Operasinya.pptx
Minggu_1_Matriks_dan_Operasinya.pptx
 
Matriks suplemen.ppt
Matriks suplemen.pptMatriks suplemen.ppt
Matriks suplemen.ppt
 
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
fdokumen.com_muh1g3-matriks-dan-ruang-vektor-3-312017-muh1g3-matriks-dan-ruan...
 
10-Image-Enhancement-Bagian3-2021.pptx
10-Image-Enhancement-Bagian3-2021.pptx10-Image-Enhancement-Bagian3-2021.pptx
10-Image-Enhancement-Bagian3-2021.pptx
 
08-Image-Enhancement-Bagian1.pptx
08-Image-Enhancement-Bagian1.pptx08-Image-Enhancement-Bagian1.pptx
08-Image-Enhancement-Bagian1.pptx
 
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
03-Pembentukan-Citra-dan-Digitalisasi-Citra.pptx
 
04-Format-citra-dan-struktur-data-citra-2021.pptx
04-Format-citra-dan-struktur-data-citra-2021.pptx04-Format-citra-dan-struktur-data-citra-2021.pptx
04-Format-citra-dan-struktur-data-citra-2021.pptx
 
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
02-Pengantar-Pengolahan-Citra-Bag2-2021.pptx
 
03spatialfiltering-130424050639-phpapp02.pptx
03spatialfiltering-130424050639-phpapp02.pptx03spatialfiltering-130424050639-phpapp02.pptx
03spatialfiltering-130424050639-phpapp02.pptx
 
Q-Step_WS_02102019_Practical_introduction_to_Python.pptx
Q-Step_WS_02102019_Practical_introduction_to_Python.pptxQ-Step_WS_02102019_Practical_introduction_to_Python.pptx
Q-Step_WS_02102019_Practical_introduction_to_Python.pptx
 
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptxBAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
BAB 2_TIPE DATA, VARIABEL, DAN OPERATOR (1) (1).pptx
 
Support-Vector-Machines_EJ_v5.06.pptx
Support-Vector-Machines_EJ_v5.06.pptxSupport-Vector-Machines_EJ_v5.06.pptx
Support-Vector-Machines_EJ_v5.06.pptx
 
06-Image-Histogram-2021.pptx
06-Image-Histogram-2021.pptx06-Image-Histogram-2021.pptx
06-Image-Histogram-2021.pptx
 
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
05-Operasi-dasar-pengolahan-citra-2021 (1).pptx
 

Recently uploaded

Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
22ad0301
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
exukyp
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
hiju9823
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 

Recently uploaded (20)

Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 

4578663.ppt

  • 1. Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU
  • 2. Bag of Words Models • do we really need elaborate linguistic analysis? • look at text mining applications – document retrieval – opinion mining – association mining • see how far we can get with document-level bag- of-words models – and introduce some of our mathematical approaches 1/16/14 NYU 2
  • 3. • document retrieval • opinion mining • association mining 1/16/14 NYU 3
  • 4. Information Retrieval • Task: given query = list of keywords, identify and rank relevant documents from collection • Basic idea: find documents whose set of words most closely matches words in query 1/16/14 NYU 4
  • 5. Topic Vector • Suppose the document collection has n distinct words, w1, …, wn • Each document is characterized by an n-dimensional vector whose ith component is the frequency of word wi in the document Example • D1 = [The cat chased the mouse.] • D2 = [The dog chased the cat.] • W = [The, chased, dog, cat, mouse] (n = 5) • V1 = [ 2 , 1 , 0 , 1 , 1 ] • V2 = [ 2 , 1 , 1 , 1 , 0 ] 1/16/14 NYU 5
  • 6. Weighting the components • Unusual words like elephant determine the topic much more than common words such as “the” or “have” • can ignore words on a stop list or • weight each term frequency tfi by its inverse document frequency idfi • where N = size of collection and ni = number of documents containing term i 1/16/14 NYU 6  witfiidfi  idfilog(N ni )
  • 7. Cosine similarity metric Define a similarity metric between topic vectors A common choice is cosine similarity (dot product): 1/16/14 NYU 7  sim(A,B) aibi i  ai 2 i   bi 2 i 
  • 8. Cosine similarity metric • the cosine similarity metric is the cosine of the angle between the term vectors: 1/16/14 NYU 8 w1 w2
  • 9. Verdict: a success • For heterogenous text collections, the vector space model, tf-idf weighting, and cosine similarity have been the basis for successful document retrieval for over 50 years
  • 10. • document retrieval • opinion mining • association mining 1/16/14 NYU 10
  • 11. Opinion Mining – Task: judge whether a document expresses a positive or negative opinion (or no opinion) about an object or topic • classification task • valuable for producers/marketers of all sorts of products – Simple strategy: bag-of-words • make lists of positive and negative words • see which predominate in a given document (and mark as ‘no opinion’ if there are few words of either type • problem: hard to make such lists – lists will differ for different topics 1/16/14 11 NYU
  • 12. Training a Model – Instead, label the documents in a corpus and then train a classifier from the corpus • labeling is easier than thinking up words • in effect, learn positive and negative words from the corpus – probabilistic classifier • identify most likely class s = argmax P ( t | W) t є {pos, neg} 1/16/14 12 NYU
  • 13. Using Bayes’ Rule • The last step is based on the naïve assumption of independence of the word probabilities 1/16/14 NYU 13  argmaxt P(t|W) argmaxt P(W |t)P(t) P(W) argmaxt P(W |t)P(t) argmaxt P(w1,...,wn |t)P(t) argmaxt P(wi |t)P(t) i 
  • 14. Training • We now estimate these probabilities from the training corpus (of N documents) using maximum likelihood estimators • P ( t ) = count ( docs labeled t) / N • P ( wi | t ) = count ( docs labeled t containing wi ) count ( docs labeled t) 1/16/14 NYU 14
  • 15. A Problem • Suppose a glowing review GR (with lots of positive words) includes one word, “mathematical”, previously seen only in negative reviews • P ( positive | GR ) = ? 1/16/14 NYU 15
  • 16. P ( positive | GR ) = 0 because P ( “mathematical” | positive ) = 0 The maximum likelihood estimate is poor when there is very little data We need to ‘smooth’ the probabilities to avoid this problem 1/16/14 NYU 16
  • 17. Laplace Smoothing • A simple remedy is to add 1 to each count – to keep them as probabilities we increase the denominator N by the number of outcomes (values of t) (2 for ‘positive’ and ‘negative’) – for the conditional probabilities P( w | t ) there are similarly two outcomes (w is present or absent) 1/16/14 NYU 17   t t P 1 ) (
  • 18. An NLTK Demo Sentiment Analysis with Python NLTK Text Classification • http://text-processing.com/demo/sentiment/ NLTK Code (simplified classifier) • http://streamhacker.com/2010/05/10/text- classification-sentiment-analysis-naive-bayes-classifier 1/16/14 NYU 18
  • 19. • Is “low” a positive or a negative term? 1/16/14 NYU 19
  • 20. Ambiguous terms “low” can be positive “low price” or negative “low quality” 1/16/14 NYU 20
  • 21. Verdict: mixed • A simple bag-of-words strategy with a NB model works quite well for simple reviews referring to a single item, but fails – for ambiguous terms – for comparative reviews – to reveal aspects of an opinion • the car looked great and handled well, but the wheels kept falling off 1/16/14 NYU 21
  • 22. • document retrieval • opinion mining • association mining 1/16/14 NYU 22
  • 23. Association Mining • Goal: find interesting relationships among attributes of an object in a large collection … objects with attribute A also have attribute B – e.g., “people who bought A also bought B” • For text: documents with term A also have term B – widely used in scientific and medical literature 1/16/14 NYU 23
  • 24. Bag-of-words • Simplest approach • look for words A and B for which frequency (A and B in same document) >> frequency of A x frequency of B • doesn’t work well • want to find names (of companies, products, genes), not individual words • interested in specific types of terms • want to learn from a few examples – need contexts to avoid noise 1/16/14 NYU 24
  • 25. Needed Ingredients • Effective text association mining needs – name recognition – term classification – preferably: ability to learn patterns – preferably: a good GUI – demo: www.coremine.com 1/16/14 NYU 25
  • 26. Conclusion • Some tasks can be handled effectively (and very simply) by bag-of-words models, • but most benefit from an analysis of language structure 1/16/14 NYU 26