SlideShare a Scribd company logo
1 of 28
Download to read offline
TF-IDF
Natalie Parde
UIC CS 421
What other ways can
we build vector
representations for
words?
critique
c1 … critique … cn
w1 … … … … …
… … … … … …
critique ? ? ? ? ?
… … … … … …
wn … … … … …
Natalie Parde - UIC CS 421
One Approach: TF-IDF
• Term Frequency * Inverse Document Frequency
• Meaning of a word is defined by the counts of
words in the same document, as well as overall
• To do this, a co-occurrence matrix is needed
Natalie Parde - UIC CS 421
TF-IDF originated as a tool for
information retrieval.
• Rows: Words in a vocabulary
• Columns: Documents in a
selection
As You
Like It
Twelfth
Night
Julius
Caesar
Henry V
Natalie Parde - UIC CS 421
TF-IDF originated as a tool for
information retrieval.
• Rows: Words in a vocabulary
• Columns: Documents in a
selection
As You
Like It
Twelfth
Night
Julius
Caesar
Henry V
As You
Like It
Twelfth
Night
Julius
Caesar
Henry V
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
“wit” appears 3 times in Henry V
Natalie Parde - UIC CS 421
In a term-document matrix, rows can be
viewed as word vectors.
• Each dimension
corresponds to a
document
• Words with similar
vectors occur in similar
documents
As You
Like It
Twelfth
Night
Julius
Caesar
Henry
V
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Natalie Parde - UIC CS 421
In a term-document matrix, rows can be
viewed as word vectors.
As You
Like It
Twelfth
Night
Julius
Caesar
Henry
V
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Julius Caesar
Henry
V
battle [7, 13]
good [62, 89]
fool [1, 4]
wit [2, 3]
Natalie Parde - UIC CS 421
Different
Types of
Context
• Documents aren’t the most common type of
context used to represent meaning in word
vectors
• More common: word context
• Referred to as a term-term matrix, word-word
matrix, or term-context matrix
• In a word-word matrix, the columns are also
labeled by words
• Thus, dimensionality is |V| x |V|
• Each cell records the number of times the
row (target) word and the column (context)
word co-occur in some context in a training
corpus
Natalie Parde - UIC CS 421
How can you decide
if two words occur
in the same
context?
• Common context windows:
• Entire document
• Cell value = # times the
words co-occur in the
same document
• Predetermined span
surrounding the target
• Cell value = # times the
words co-occur in this
span of words
Natalie Parde - UIC CS 421
Example Context Window (Size = 4)
• Take each occurrence of a word (e.g., strawberry)
• Count the context words in the four-word spans before and after it
to get a word-word co-occurrence matrix
is traditionally followed by cherry pie, a traditional dessert
often mixed, such as strawberry rhubarb pie. Apple pie
computer peripherals and personal digital assistants. These devices usually
a computer. This includes information available on the internet
Natalie Parde - UIC CS 421
Example Context
Window (Size = 4)
• A simplified subset of a word-
word co-occurrence matrix
could appear as follows, given a
sufficient corpus
aardvark … computer data result pie sugar …
cherry 0 … 2 8 9 442 25 …
strawberry 0 … 0 0 1 60 19 …
digital 0 … 1670 1683 85 5 4 …
information 0 … 3325 3982 378 5 13 …
is traditionally followed by cherry pie, a traditional dessert
often mixed, such as strawberry rhubarb pie. Apple pie
computer peripherals and personal digital assistants. These devices usually
a computer. This includes information available on the internet Vector for
“strawberry”
Natalie Parde - UIC CS 421
So far, our co-
occurrence
matrices have
contained raw
frequency
counts of
word co-
occurrences.
• However, this isn’t the best measure of
association between words
• Some words co-occur frequently with
many words, so won’t be very informative
• the, it, they
• We want to know about words that co-
occur frequently with one another, but
less frequently across all texts
Natalie Parde - UIC CS 421
This is
where TF-
IDF comes
in handy!
• Term Frequency: The frequency of the
word t in the document d
• 𝑡𝑓!,# = count(𝑡, 𝑑)
• Document Frequency: The number of
documents in which the word t occurs
• Different from collection frequency (the
number of times the word occurs in the
entire collection of documents)
Natalie Parde - UIC CS 421
Computing
TF-IDF
• Inverse Document Frequency: The inverse of
document frequency, where N is the total number of
documents in the collection
• 𝑖𝑑𝑓! =
"
#$!
• IDF is higher when the term occurs in fewer
documents
• What is a document?
• Individual instance in your corpus (e.g., book,
play, sentence, etc.)
• It is often useful to perform these computations in log
space
• TF: log%&(𝑡𝑓!,#+1)
• IDF: log%& 𝑖𝑑𝑓!
Natalie Parde - UIC CS 421
Computing
TF*IDF
• TF-IDF is then simply the combination of TF
and IDF
• 𝑡𝑓𝑖𝑑𝑓!,# = 𝑡𝑓!,#×𝑖𝑑𝑓!
Natalie Parde - UIC CS 421
Example:
Computing
TF-IDF
• TF-IDF(battle, d1) = ?
d1 d2 d3 d4
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Natalie Parde - UIC CS 421
• TF-IDF(battle, d1) = ?
• TF(battle, d1) = 1
d1 d2 d3 d4
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Natalie Parde - UIC CS 421
Example:
Computing
TF-IDF
• TF-IDF(battle, d1) = ?
• TF(battle, d1) = 1
• IDF(battle) = N/DF(battle) =
37/21 = 1.76
d1 d2 d3 d4
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
word df
battle 21
good 37
fool 36
wit 34
Document frequencies from
37-document corpus
Natalie Parde - UIC CS 421
Example:
Computing
TF-IDF
• TF-IDF(battle, d1) = ?
• TF(battle, d1) = 1
• IDF(battle) = N/DF(battle) = 37/21
= 1.76
• TF-IDF(battle, d1) = 1 * 1.76 =
1.76
d1 d2 d3 d4
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Natalie Parde - UIC CS 421
Example:
Computing
TF-IDF
• TF-IDF(battle, d1) = ?
• TF(battle, d1) = 1
• IDF(battle) = N/DF(battle) = 37/21
= 1.76
• TF-IDF(battle, d1) = 1 * 1.76 =
1.76
• Alternately, TF-IDF(battle, d1) =
𝒍𝒐𝒈𝟏𝟎(𝟏 + 𝟏) ∗ 𝒍𝒐𝒈𝟏𝟎 𝟏. 𝟕𝟔 =
0.074
d1 d2 d3 d4
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Natalie Parde - UIC CS 421
Example:
Computing
TF-IDF
• TF-IDF(battle, d1) = ?
• TF(battle, d1) = 1
• IDF(battle) = N/DF(battle) = 37/21
= 1.76
• TF-IDF(battle, d1) = 1 * 1.76 =
1.76
• Alternately, TF-IDF(battle, d1) =
𝑙𝑜𝑔$%(1 + 1) ∗ 𝑙𝑜𝑔$% 1.76 = 0.074
d1 d2 d3 d4
battle 0.074 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Natalie Parde - UIC CS 421
Example:
Computing
TF-IDF
To convert our
entire word co-
occurrence matrix
to a TF-IDF
matrix, we need
to repeat this
calculation for
each element.
d1 d2 d3 d4
battle 0.074 0.000 0.220 0.280
good 0.000 0.000 0.000 0.000
fool 0.019 0.021 0.004 0.008
wit 0.049 0.044 0.018 0.022
Natalie Parde - UIC CS 421
How does the TF-IDF matrix compare
to the original term frequency matrix?
d1 d2 d3 d4
battle
1 0 7 13
good
114 80 62 89
fool
36 58 1 4
wit
20 15 2 3
d1 d2 d3 d4
battle 0.074 0.000 0.220 0.280
good 0.000 0.000 0.000 0.000
fool 0.019 0.021 0.004 0.008
wit 0.049 0.044 0.018 0.022
Natalie Parde - UIC CS 421
How does the TF-IDF matrix compare
to the original term frequency matrix?
d1 d2 d3 d4
battle
1 0 7 13
good
114 80 62 89
fool
36 58 1 4
wit
20 15 2 3
d1 d2 d3 d4
battle 0.074 0.000 0.220 0.280
good 0.000 0.000 0.000 0.000
fool 0.019 0.021 0.004 0.008
wit 0.049 0.044 0.018 0.022
Occurs in every document …not important in the overall scheme of things!
Natalie Parde - UIC CS 421
How does the TF-IDF matrix compare
to the original term frequency matrix?
d1 d2 d3 d4
battle
1 0 7 13
good
114 80 62 89
fool
36 58 1 4
wit
20 15 2 3
d1 d2 d3 d4
battle 0.074 0.000 0.220 0.280
good 0.000 0.000 0.000 0.000
fool 0.019 0.021 0.004 0.008
wit 0.049 0.044 0.018 0.022
Increases the importance of rarer words like “battle”
Natalie Parde - UIC CS 421
Note that the TF-IDF
model produces a
sparse vector.
• Sparse: Many (usually
most) cells have values
of 0
d1 d2 d3 d4
battle 0.074 0.000 0.220 0.280
good 0.000 0.000 0.000 0.000
fool 0.019 0.021 0.004 0.008
wit 0.049 0.044 0.018 0.022
Natalie Parde - UIC CS 421
d1 d2 d3 d4 d5 d6 d7
battle 0.1 0.0 0.0 0.0 0.2 0.0 0.3
good 0.0 0.0 0.0 0.0 0.0 0.0 0.0
fool 0.0 0.0 0.0 0.0 0.0 0.0 0.0
wit 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Natalie Parde - UIC CS 421
Note that the TF-IDF
model produces a
sparse vector.
• Sparse: Many (usually
most) cells have values
of 0
This can be
problematic!
• However, TF-IDF remains a useful starting
point for vector space models
• Generally combined with standard machine
learning algorithms
• Logistic Regression
• Naïve Bayes
Natalie Parde - UIC CS 421

More Related Content

Similar to TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with LuceneKai Chan
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 wordsananth
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsTrey Grainger
 
Recommender systems
Recommender systemsRecommender systems
Recommender systemsVenkat Raman
 
Lecture 25
Lecture 25Lecture 25
Lecture 25Shani729
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vecananth
 

Similar to TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy (8)

IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
Lecture 25
Lecture 25Lecture 25
Lecture 25
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 

More from RAtna29

PatternMatching2.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
PatternMatching2.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnPatternMatching2.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
PatternMatching2.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnRAtna29
 
NLP-ppt.pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP-ppt.pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNLP-ppt.pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP-ppt.pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnRAtna29
 
Lexical-Semantics-NLP.pptx bbbbbbbbbbbbbbb
Lexical-Semantics-NLP.pptx bbbbbbbbbbbbbbbLexical-Semantics-NLP.pptx bbbbbbbbbbbbbbb
Lexical-Semantics-NLP.pptx bbbbbbbbbbbbbbbRAtna29
 
btrees.ppt ttttttttttttttttttttttttttttt
btrees.ppt  tttttttttttttttttttttttttttttbtrees.ppt  ttttttttttttttttttttttttttttt
btrees.ppt tttttttttttttttttttttttttttttRAtna29
 
lecture09.ppt zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
lecture09.ppt zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzlecture09.ppt zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
lecture09.ppt zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzRAtna29
 
Indexing.ppt mmmmmmmmmmmmmmmmmmmmmmmmmmmmm
Indexing.ppt mmmmmmmmmmmmmmmmmmmmmmmmmmmmmIndexing.ppt mmmmmmmmmmmmmmmmmmmmmmmmmmmmm
Indexing.ppt mmmmmmmmmmmmmmmmmmmmmmmmmmmmmRAtna29
 
Model-Theoretic Semantics.pdf NNNNNNNNNNNNNNNNNNNNN
Model-Theoretic Semantics.pdf NNNNNNNNNNNNNNNNNNNNNModel-Theoretic Semantics.pdf NNNNNNNNNNNNNNNNNNNNN
Model-Theoretic Semantics.pdf NNNNNNNNNNNNNNNNNNNNNRAtna29
 
exing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
exing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhexing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
exing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhRAtna29
 
redblacktreeindatastructure-200409083949.pptx
redblacktreeindatastructure-200409083949.pptxredblacktreeindatastructure-200409083949.pptx
redblacktreeindatastructure-200409083949.pptxRAtna29
 
btree.ppt ggggggggggggggggggggggggggggggg
btree.ppt gggggggggggggggggggggggggggggggbtree.ppt ggggggggggggggggggggggggggggggg
btree.ppt gggggggggggggggggggggggggggggggRAtna29
 
queuesArrays.ppt bbbbbbbbbbbbbbbbbbbbbbbbbb
queuesArrays.ppt bbbbbbbbbbbbbbbbbbbbbbbbbbqueuesArrays.ppt bbbbbbbbbbbbbbbbbbbbbbbbbb
queuesArrays.ppt bbbbbbbbbbbbbbbbbbbbbbbbbbRAtna29
 
Stack and Queue.pptx university exam preparation
Stack and Queue.pptx university exam preparationStack and Queue.pptx university exam preparation
Stack and Queue.pptx university exam preparationRAtna29
 
avltrees.ppt unit 4 ppts for the university exam
avltrees.ppt unit 4 ppts for the university examavltrees.ppt unit 4 ppts for the university exam
avltrees.ppt unit 4 ppts for the university examRAtna29
 
avltreesUNIT4.ppt ggggggggggggggguuuuuuuuu
avltreesUNIT4.ppt ggggggggggggggguuuuuuuuuavltreesUNIT4.ppt ggggggggggggggguuuuuuuuu
avltreesUNIT4.ppt ggggggggggggggguuuuuuuuuRAtna29
 
nlp semantic for parsing ... mmmmmmmmmmmm
nlp semantic for parsing ... mmmmmmmmmmmmnlp semantic for parsing ... mmmmmmmmmmmm
nlp semantic for parsing ... mmmmmmmmmmmmRAtna29
 

More from RAtna29 (15)

PatternMatching2.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
PatternMatching2.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnPatternMatching2.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
PatternMatching2.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
 
NLP-ppt.pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP-ppt.pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNLP-ppt.pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP-ppt.pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
 
Lexical-Semantics-NLP.pptx bbbbbbbbbbbbbbb
Lexical-Semantics-NLP.pptx bbbbbbbbbbbbbbbLexical-Semantics-NLP.pptx bbbbbbbbbbbbbbb
Lexical-Semantics-NLP.pptx bbbbbbbbbbbbbbb
 
btrees.ppt ttttttttttttttttttttttttttttt
btrees.ppt  tttttttttttttttttttttttttttttbtrees.ppt  ttttttttttttttttttttttttttttt
btrees.ppt ttttttttttttttttttttttttttttt
 
lecture09.ppt zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
lecture09.ppt zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzlecture09.ppt zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
lecture09.ppt zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
 
Indexing.ppt mmmmmmmmmmmmmmmmmmmmmmmmmmmmm
Indexing.ppt mmmmmmmmmmmmmmmmmmmmmmmmmmmmmIndexing.ppt mmmmmmmmmmmmmmmmmmmmmmmmmmmmm
Indexing.ppt mmmmmmmmmmmmmmmmmmmmmmmmmmmmm
 
Model-Theoretic Semantics.pdf NNNNNNNNNNNNNNNNNNNNN
Model-Theoretic Semantics.pdf NNNNNNNNNNNNNNNNNNNNNModel-Theoretic Semantics.pdf NNNNNNNNNNNNNNNNNNNNN
Model-Theoretic Semantics.pdf NNNNNNNNNNNNNNNNNNNNN
 
exing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
exing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhexing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
exing.ppt hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
 
redblacktreeindatastructure-200409083949.pptx
redblacktreeindatastructure-200409083949.pptxredblacktreeindatastructure-200409083949.pptx
redblacktreeindatastructure-200409083949.pptx
 
btree.ppt ggggggggggggggggggggggggggggggg
btree.ppt gggggggggggggggggggggggggggggggbtree.ppt ggggggggggggggggggggggggggggggg
btree.ppt ggggggggggggggggggggggggggggggg
 
queuesArrays.ppt bbbbbbbbbbbbbbbbbbbbbbbbbb
queuesArrays.ppt bbbbbbbbbbbbbbbbbbbbbbbbbbqueuesArrays.ppt bbbbbbbbbbbbbbbbbbbbbbbbbb
queuesArrays.ppt bbbbbbbbbbbbbbbbbbbbbbbbbb
 
Stack and Queue.pptx university exam preparation
Stack and Queue.pptx university exam preparationStack and Queue.pptx university exam preparation
Stack and Queue.pptx university exam preparation
 
avltrees.ppt unit 4 ppts for the university exam
avltrees.ppt unit 4 ppts for the university examavltrees.ppt unit 4 ppts for the university exam
avltrees.ppt unit 4 ppts for the university exam
 
avltreesUNIT4.ppt ggggggggggggggguuuuuuuuu
avltreesUNIT4.ppt ggggggggggggggguuuuuuuuuavltreesUNIT4.ppt ggggggggggggggguuuuuuuuu
avltreesUNIT4.ppt ggggggggggggggguuuuuuuuu
 
nlp semantic for parsing ... mmmmmmmmmmmm
nlp semantic for parsing ... mmmmmmmmmmmmnlp semantic for parsing ... mmmmmmmmmmmm
nlp semantic for parsing ... mmmmmmmmmmmm
 

Recently uploaded

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 

Recently uploaded (20)

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 

TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

  • 2. What other ways can we build vector representations for words? critique c1 … critique … cn w1 … … … … … … … … … … … critique ? ? ? ? ? … … … … … … wn … … … … … Natalie Parde - UIC CS 421
  • 3. One Approach: TF-IDF • Term Frequency * Inverse Document Frequency • Meaning of a word is defined by the counts of words in the same document, as well as overall • To do this, a co-occurrence matrix is needed Natalie Parde - UIC CS 421
  • 4. TF-IDF originated as a tool for information retrieval. • Rows: Words in a vocabulary • Columns: Documents in a selection As You Like It Twelfth Night Julius Caesar Henry V Natalie Parde - UIC CS 421
  • 5. TF-IDF originated as a tool for information retrieval. • Rows: Words in a vocabulary • Columns: Documents in a selection As You Like It Twelfth Night Julius Caesar Henry V As You Like It Twelfth Night Julius Caesar Henry V battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 “wit” appears 3 times in Henry V Natalie Parde - UIC CS 421
  • 6. In a term-document matrix, rows can be viewed as word vectors. • Each dimension corresponds to a document • Words with similar vectors occur in similar documents As You Like It Twelfth Night Julius Caesar Henry V battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Natalie Parde - UIC CS 421
  • 7. In a term-document matrix, rows can be viewed as word vectors. As You Like It Twelfth Night Julius Caesar Henry V battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Julius Caesar Henry V battle [7, 13] good [62, 89] fool [1, 4] wit [2, 3] Natalie Parde - UIC CS 421
  • 8. Different Types of Context • Documents aren’t the most common type of context used to represent meaning in word vectors • More common: word context • Referred to as a term-term matrix, word-word matrix, or term-context matrix • In a word-word matrix, the columns are also labeled by words • Thus, dimensionality is |V| x |V| • Each cell records the number of times the row (target) word and the column (context) word co-occur in some context in a training corpus Natalie Parde - UIC CS 421
  • 9. How can you decide if two words occur in the same context? • Common context windows: • Entire document • Cell value = # times the words co-occur in the same document • Predetermined span surrounding the target • Cell value = # times the words co-occur in this span of words Natalie Parde - UIC CS 421
  • 10. Example Context Window (Size = 4) • Take each occurrence of a word (e.g., strawberry) • Count the context words in the four-word spans before and after it to get a word-word co-occurrence matrix is traditionally followed by cherry pie, a traditional dessert often mixed, such as strawberry rhubarb pie. Apple pie computer peripherals and personal digital assistants. These devices usually a computer. This includes information available on the internet Natalie Parde - UIC CS 421
  • 11. Example Context Window (Size = 4) • A simplified subset of a word- word co-occurrence matrix could appear as follows, given a sufficient corpus aardvark … computer data result pie sugar … cherry 0 … 2 8 9 442 25 … strawberry 0 … 0 0 1 60 19 … digital 0 … 1670 1683 85 5 4 … information 0 … 3325 3982 378 5 13 … is traditionally followed by cherry pie, a traditional dessert often mixed, such as strawberry rhubarb pie. Apple pie computer peripherals and personal digital assistants. These devices usually a computer. This includes information available on the internet Vector for “strawberry” Natalie Parde - UIC CS 421
  • 12. So far, our co- occurrence matrices have contained raw frequency counts of word co- occurrences. • However, this isn’t the best measure of association between words • Some words co-occur frequently with many words, so won’t be very informative • the, it, they • We want to know about words that co- occur frequently with one another, but less frequently across all texts Natalie Parde - UIC CS 421
  • 13. This is where TF- IDF comes in handy! • Term Frequency: The frequency of the word t in the document d • 𝑡𝑓!,# = count(𝑡, 𝑑) • Document Frequency: The number of documents in which the word t occurs • Different from collection frequency (the number of times the word occurs in the entire collection of documents) Natalie Parde - UIC CS 421
  • 14. Computing TF-IDF • Inverse Document Frequency: The inverse of document frequency, where N is the total number of documents in the collection • 𝑖𝑑𝑓! = " #$! • IDF is higher when the term occurs in fewer documents • What is a document? • Individual instance in your corpus (e.g., book, play, sentence, etc.) • It is often useful to perform these computations in log space • TF: log%&(𝑡𝑓!,#+1) • IDF: log%& 𝑖𝑑𝑓! Natalie Parde - UIC CS 421
  • 15. Computing TF*IDF • TF-IDF is then simply the combination of TF and IDF • 𝑡𝑓𝑖𝑑𝑓!,# = 𝑡𝑓!,#×𝑖𝑑𝑓! Natalie Parde - UIC CS 421
  • 16. Example: Computing TF-IDF • TF-IDF(battle, d1) = ? d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Natalie Parde - UIC CS 421
  • 17. • TF-IDF(battle, d1) = ? • TF(battle, d1) = 1 d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Natalie Parde - UIC CS 421 Example: Computing TF-IDF
  • 18. • TF-IDF(battle, d1) = ? • TF(battle, d1) = 1 • IDF(battle) = N/DF(battle) = 37/21 = 1.76 d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 word df battle 21 good 37 fool 36 wit 34 Document frequencies from 37-document corpus Natalie Parde - UIC CS 421 Example: Computing TF-IDF
  • 19. • TF-IDF(battle, d1) = ? • TF(battle, d1) = 1 • IDF(battle) = N/DF(battle) = 37/21 = 1.76 • TF-IDF(battle, d1) = 1 * 1.76 = 1.76 d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Natalie Parde - UIC CS 421 Example: Computing TF-IDF
  • 20. • TF-IDF(battle, d1) = ? • TF(battle, d1) = 1 • IDF(battle) = N/DF(battle) = 37/21 = 1.76 • TF-IDF(battle, d1) = 1 * 1.76 = 1.76 • Alternately, TF-IDF(battle, d1) = 𝒍𝒐𝒈𝟏𝟎(𝟏 + 𝟏) ∗ 𝒍𝒐𝒈𝟏𝟎 𝟏. 𝟕𝟔 = 0.074 d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Natalie Parde - UIC CS 421 Example: Computing TF-IDF
  • 21. • TF-IDF(battle, d1) = ? • TF(battle, d1) = 1 • IDF(battle) = N/DF(battle) = 37/21 = 1.76 • TF-IDF(battle, d1) = 1 * 1.76 = 1.76 • Alternately, TF-IDF(battle, d1) = 𝑙𝑜𝑔$%(1 + 1) ∗ 𝑙𝑜𝑔$% 1.76 = 0.074 d1 d2 d3 d4 battle 0.074 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 Natalie Parde - UIC CS 421 Example: Computing TF-IDF
  • 22. To convert our entire word co- occurrence matrix to a TF-IDF matrix, we need to repeat this calculation for each element. d1 d2 d3 d4 battle 0.074 0.000 0.220 0.280 good 0.000 0.000 0.000 0.000 fool 0.019 0.021 0.004 0.008 wit 0.049 0.044 0.018 0.022 Natalie Parde - UIC CS 421
  • 23. How does the TF-IDF matrix compare to the original term frequency matrix? d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 d1 d2 d3 d4 battle 0.074 0.000 0.220 0.280 good 0.000 0.000 0.000 0.000 fool 0.019 0.021 0.004 0.008 wit 0.049 0.044 0.018 0.022 Natalie Parde - UIC CS 421
  • 24. How does the TF-IDF matrix compare to the original term frequency matrix? d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 d1 d2 d3 d4 battle 0.074 0.000 0.220 0.280 good 0.000 0.000 0.000 0.000 fool 0.019 0.021 0.004 0.008 wit 0.049 0.044 0.018 0.022 Occurs in every document …not important in the overall scheme of things! Natalie Parde - UIC CS 421
  • 25. How does the TF-IDF matrix compare to the original term frequency matrix? d1 d2 d3 d4 battle 1 0 7 13 good 114 80 62 89 fool 36 58 1 4 wit 20 15 2 3 d1 d2 d3 d4 battle 0.074 0.000 0.220 0.280 good 0.000 0.000 0.000 0.000 fool 0.019 0.021 0.004 0.008 wit 0.049 0.044 0.018 0.022 Increases the importance of rarer words like “battle” Natalie Parde - UIC CS 421
  • 26. Note that the TF-IDF model produces a sparse vector. • Sparse: Many (usually most) cells have values of 0 d1 d2 d3 d4 battle 0.074 0.000 0.220 0.280 good 0.000 0.000 0.000 0.000 fool 0.019 0.021 0.004 0.008 wit 0.049 0.044 0.018 0.022 Natalie Parde - UIC CS 421
  • 27. d1 d2 d3 d4 d5 d6 d7 battle 0.1 0.0 0.0 0.0 0.2 0.0 0.3 good 0.0 0.0 0.0 0.0 0.0 0.0 0.0 fool 0.0 0.0 0.0 0.0 0.0 0.0 0.0 wit 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Natalie Parde - UIC CS 421 Note that the TF-IDF model produces a sparse vector. • Sparse: Many (usually most) cells have values of 0
  • 28. This can be problematic! • However, TF-IDF remains a useful starting point for vector space models • Generally combined with standard machine learning algorithms • Logistic Regression • Naïve Bayes Natalie Parde - UIC CS 421