The document discusses various techniques for representing text for classification tasks, including bag-of-words representations, term frequency-inverse document frequency (TF-IDF) weighting, and n-grams. It covers extracting unigram and bigram features to capture word and word-pair frequencies from documents. The document also discusses removing stopwords and using word histograms or one-hot encodings to transform text into numerical feature vectors.
5. Three Generations of Features
or Representations
1. Intuition driven representations (hand-crafted)
2. Representations that are derived based on Statistics, Signal Processing etc.
3. Representations that are learned
5
6. Problem: Classify Textual Content
► Problem: Classify into one of C classes.
► Eg. Email: Spam vs. Non-spam, Professional vs. Personal,
Web Page: Movies vs. Sports Vs Politics
6
7. What Do We Mean by
Representing Text?
► Representations for words?
► Representations for phrases?
► Representations for sentences?
► Representations for documents?
7
8. Bag of Words - Text Domain - Motivation
► Word cloud: The size of the word indicates how often the word occurs in the
document
8
9. Bag of Words - Text Domain - Motivation
► What is the document talking about?
9
10. Bag of Words - Text Domain - Motivation
► Orderless document representation; frequencies of words from a dictionary
10
16. Problem Statement
► Input: Covid-19 education grading
new system
► Task: Identify the countries from
the article.
11
17. Problem Statement
► Input: Covid-19 education grading
new system
► Task: Identify the countries from
the article.
► Problem: How do we represent?
11
25. Bag of Words Histogram
► Orderless document representation; frequencies of words from a dictionary
13
26. Bag of Words Histogram
► Orderless document representation; frequencies of words from a dictionary
► Classification to determine document category
13
40. Comments: Weight Words (TF-IDF)
► Not all words are equally useful.
) Some words do not add value.
) e.g., the, of, and, is, a, etc.
► Stop Words: removed from the dictionary
► Some words are more important
) Words that occur multiple times
) Words that are unique to a document
19
41. TF-IDF
► Frequent Words (TF:Term Frequency)
) Higher the frequency, more relevant
) TF measures the frequency (count) of a term in a document
) TF is often divided by the document length
TF(T) = # of occurences of T in D i
# of words in D i
20
42. TF-IDF
► Rare/Unique Words (IDF:Inverse Document Frequency)
) Need to weigh down the frequent terms while scale up the rare ones
IDF(T) = log ( # of documents
)
e # of docs with term T
21
43. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
Ram
R atna
22
44. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
T F of “Andrew” in D 1 =
Ram
R atna
22
45. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
T F = # of occurences of T in D i
# of words in D i
T F of “Andrew” in D 1 =
Ram
R atna
22
46. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
T F = # of occurences of T in D i
# of words in D i
1
T F of “Andrew” in D 1 = 5 = 0.2
Ram
R atna
22
47. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
R am
T F = # of occurences of T in D i
# of words in D i
5
T F of “Andrew” in D 1 = 1 = 0.2
2
T F of “good” in D =
R atna
22
48. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
T F = # of occurences of T in D i
# of words in D i
5
T F of “Andrew” in D 1 = 1 = 0.2
2 = 2
T F of “good” in D 9 = 0.22
Ram
R atna
22
49. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
Ram
R atna
23
50. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
I D F (JA J)
Ram
R atna
23
51. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
e
# of documents
# of docs with term T
I D F (T ) = log
I D F (JA J)
Ram
R atna
23
52. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
I D F (T ) = loge
# of documents
# of docs with term T
e 2
I D F (JA J) = log (2 ) = 0
Ram
R atna
23
53. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
I D F (T ) = loge
# of documents
# of docs with term T
2
I D F (JA J) = log (2 ) = 0
e
I D F (JAndrewJ)
Ram
R atna
23
54. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
I D F (T ) = loge
# of documents
# of docs with term T
e 2
I D F (JA J) = log (2 ) = 0
1
I D F (JAndrewJ) = loge (2) = 0.69
Ram
R atna
23
55. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
R am
I D F (T ) = loge
# of documents
# of docs with term T
e 2
I D F (JA J) = log (2 ) = 0
1
I D F (JAndrewJ) = loge (2) = 0.69
J J
I D F ( is )
R atna
23
56. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
R am
I D F (T ) = loge
# of documents
# of docs with term T
e 2
I D F (JA J) = log (2 ) = 0
1
I D F (JAndrewJ) = loge (2) = 0.69
I D F (JisJ) = loge (2) = 0
2
R atna
23
57. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
R am
I D F (T ) = loge
# of documents
# of docs with term T
e 2
I D F (JA J) = log (2 ) = 0
1
I D F (JAndrewJ) = loge (2) = 0.69
e 2
I D F (JisJ) = log (2 ) = 0
I D F (JgoodJ)
R atna
23
58. TF-IDF
D1: Andrew is a tall boy.
D2: Ram is a good boy. Ratna is also good.
Dictionary
A
Also
Andrew
Boy
Good
Tall
I s
Ram
R atna
I D F (T ) = loge
# of documents
# of docs with term T
2
I D F (JA J) = loge (2) = 0
1
I D F (JAndrewJ) = loge (2) = 0.69
e 2
I D F (JisJ) = log (2 ) = 0
1
I D F (JgoodJ) = loge (2) = 0.69
23
66. Stopwords
► Actual text: Word Count = 89
In reality, the laborer belongs to capital before he has sold himself to capital. His
economic bondage is both brought about and concealed by the periodic sale of
himself, by his change of masters, and by the oscillation in the market price of
labor power. Capitalist production, therefore, under its aspect of a continuous
connected process, of a process of reproduction, produces not only commodities,
not only surplus value, but it also produces and reproduces the capitalist relation;
on the one side the capitalist, on the other the wage-laborer.
25
67. Stopwords
► Actual text: Word Count = 89
In reality, the laborer belongs to capital before he has sold himself to capital. His
economic bondage is both brought about and concealed by the periodic sale of
himself, by his change of masters, and by the oscillation in the market price of
labor power. Capitalist production, therefore, under its aspect of a continuous
connected process, of a process of reproduction, produces not only commodities,
not only surplus value, but it also produces and reproduces the capitalist relation;
on the one side the capitalist, on the other the wage-laborer.
Stopwords: the, a, and, is, of, to, in, on, by, he, his, has,it,its,
therefore, under, side, one, not, only, but, also, other
25
70. N-Grams: Capturing Context
► An n-gram is a sequence of n consecutive items (words, characters)
) Unigram, Bigram, Trigram, 4-gram, 5-gram, etc.
► Histogram or BoW was counting unigrams.
► What about Bigrams or Trigrams?
27
71. Bigrams
Friends, Romans, countrymen, lend me your ears
I come to bury Caesar, not to praise him.
The evil that men do lives after them;
The good is oft interred with their bones;
So let it be with Caesar.
The noble Brutus hath told you Caesar was ambitious:
If it were so, it was a grievous fault,
And grievously hath Caesaranswered it.
Here, under leave of Brutus and the rest
For Brutus is an honorable man;
So are they all, all honorable men
Come I to speak in Caesar’s funeral.
He was my friend, faithful and just to me:
But Brutus says he was ambitious;
And Brutus is an honorable man.
He hath brought many captives home to Rome
Whose ransoms did the general coffers fill:
Did this in Caesar seem ambitious?
When that the poor have cried, Caesar hath wept:
Ambition should be made of sterner stuff:
......................................................
28
72. Bigrams
Friends, Romans, countrymen, lend me your ears
I come to bury Caesar, not to praise him.
The evil that men do lives after them;
The good is oft interred with their bones;
So let it be with Caesar.
The noble Brutus hath told you Caesar was ambitious:
If it were so, it was a grievous fault,
And grievously hath Caesaranswered it.
Here, under leave of Brutus and the rest
For Brutus is an honorable man;
So are they all, all honorable men
Come I to speak in Caesar’s funeral.
He was my friend, faithful and just to me:
But Brutus says he was ambitious;
And Brutus is an honorable man.
He hath brought many captives home to Rome
Whose ransoms did the general coffers fill:
Did this in Caesar seem ambitious?
When that the poor have cried, Caesar hath wept:
Ambition should be made of sterner stuff:
......................................................
Marc Antony
Caesars funeral
WILLIAM SHAKESPEARE
28
73. Bigrams
Friends, Romans, countrymen, lend me your ears
I come to bury Caesar, not to praise him.
The evil that men do lives after them;
The good is oft interred with their bones;
So let it be with Caesar.
The noble Brutus hath told you Caesar was ambitious:
If it were so, it was a grievous fault,
And grievously hath Caesaranswered it.
Here, under leave of Brutus and the rest
For Brutus is an honorable man;
So are they all, all honorable men
Come I to speak in Caesar’s funeral.
He was my friend, faithful and just to me:
But Brutus says he was ambitious;
And Brutus is an honorable man.
He hath brought many captives home to Rome
Whose ransoms did the general coffers fill:
Did this in Caesar seem ambitious?
When that the poor have cried, Caesar hath wept:
Ambition should be made of sterner stuff:
......................................................
► (’of’, ’caesar’):6
► (’you’, ’know’):6
► (’you’, ’all’):5
► (’was’, ’ambitious’):4
► (’brutus’, ’is’):4
► (’is’, ’an’):4
► (’an’, ’honorable’):4
► (’honorable’, ’man’):4
► (’honorable’, ’men’):4
► (’he’, ’was’):4
► (’in’, ’his’):4
► (’know’, ’not’):4
► (’to’, ’speak’):3
► (’brutus’, ’says’):3
► (’ambitious’, ’and’):3
► (’and’, ’brutus’):3
► (’he’, ’hath’):3
► (’tell’, ’you’):3
► (’let’, ’it’):2
► (’with’, ’caesar’):2
► (’the’, ’noble’):2
28
75. Use of N-grams
► Representing sentences, documents
► Modeling language
) Predicting next word (Auto-complete)
) Resolving ambiguity (Speech recognition, OCR)
) Machine Translation (choosing one sentence over another)
) Many other tasks: Question answering, classification, etc.
► Smoothing N-grams
30
76. Summary
► Standard Techniques:
) Remove stop words; Non-informative words
) Weigh words differently
► Histogram (Bag of Words)
) Unordered simple representation
) Though the structure of sentences lost in histogram, wecan still classify
based on ordering of words.
31
77. Summary
► Deeper Techniques from NLP:
) N-gram Representations
) Identify named entities (proper nouns), word morphology.
► Learning Representations
32