2. Agenda
● Text classification
● Sparse data
○ Dimensionality reduction / visualization sparse data
○ Classification on sparse data
● Text embedding
○ Short explanation doc2vec
○ Visualization sparse vs embedded
○ Classification sparse vs embedded
● Hands-on!
3. Text classification - Definition
● Text classification is the task of assigning predefined categories to free-text documents.
7. Example: Every word is a feature
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class B
Document 4:
Class B
1: arrived 0 1 4 5
2: received 1 2 3 5
3: gold 4 4 4 1
4: a 1 0 1 2
5: energy 5 5 5 3
Feature vector Feature space
9. Text = high dimensional
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class B
Document 4:
Class B
1: arrived 0 1 4 5
2: received 1 2 3 5
3: gold 4 4 4 1
4: a 1 0 1 2
5: energy 5 5 5 3
10. Text = sparse
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class A
Document 4:
Class A
Document 5:
Class ?
1: acquired 0 0 1 0 0
2: received 0 2 0 0 0
3: collected 1 0 0 0 0
4: a 0 0 0 2 0
5: energy 0 0 0 0 1
11. Dataset: Reuters news article dataset
100 top words
across whole
corpus
For each
document count
how often each
word occurs
Word 1
Word 2
Word 3
Word 100
Document 1 Document 2
Word 1 0 2
Word 2 3 1
Word 3 4 4
Word 100 1 1
Feature space of
100 dimensions
containing 21578
data points
25. Embedding - pipeline
Documents = text + category
doc2vec
classification
Text
(10000+
dimensions)
document
features 100
dimensions
categories
26. Reuters - score doc2vec vs top 100 words
Word count top 100 words:
0.90
Doc2vec:
0.94
27. IMDB movie reviews - doc2vec vs wordcount
Class: positive
Bromwell High is nothing short of brilliant. Expertly
scripted and perfectly delivered, this searing parody of
a students and teachers at a South London Public
School leaves you literally rolling with laughter. It's
vulgar, provocative, witty and sharp. The characters
are a superbly caricatured cross section of British
society (or to be more accurate, of any society).
Following the escapades of Keisha, Latrina and
Natella, our three "protagonists" for want of a better
term, the show doesn't shy away from parodying every
imaginable subject. Political correctness flies out the
window in every episode. If you enjoy shows that
aren't afraid to poke fun of every taboo subject
imaginable, then Bromwell High will not disappoint!
Class: negative
Robert DeNiro plays the most unbelievably intelligent
illiterate of all time. This movie is so wasteful of talent,
it is truly disgusting. The script is unbelievable. The
dialog is unbelievable. Jane Fonda's character is a
caricature of herself, and not a funny one. The movie
moves at a snail's pace, is photographed in an
ill-advised manner, and is insufferably preachy. It also
plugs in every cliche in the book. Swoozie Kurtz is
excellent in a supporting role, but so what?<br /><br
/>Equally annoying is this new IMDB rule of requiring
ten lines for every review. When a movie is this
worthless, it doesn't require ten lines of text to let other
readers know that it is a waste of time and tape. Avoid
this movie.
29. IMDB movie reviews - doc2vec vs wordcount
Word count top 250 words:
0.72
Doc2vec:
0.83
30. Conclusion
● It’s all about extracting the right features from your data
● Visualize the data to get a sense of the value of your features
● You can use the same algorithms for text, image, audio and other kinds of
data once it converted to an abstract feature space