Text classification presentation

Text classification
Kennissessie

Agenda
● Text classification
● Sparse data
○ Dimensionality reduction / visualization sparse data
○ Classification on sparse data
● Text embedding
○ Short explanation doc2vec
○ Visualization sparse vs embedded
○ Classification sparse vs embedded
● Hands-on!

Text classification - Definition
● Text classification is the task of assigning predefined categories to free-text documents.

Example: News article classification
What is the category of this news article?

Example: News article classification
Examples:
Great war
Examples:
Sunken ships

Example: Every word is a feature
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class B
Document 4:
Class B
1: arrived 0 1 4 5
2: received 1 2 3 5
3: gold 4 4 4 1
4: a 1 0 1 2
5: energy 5 5 5 3
Feature vector Feature space

Dimensionality
Features
(one word per feature)
Classes

Text = high dimensional
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class B
Document 4:
Class B
1: arrived 0 1 4 5
2: received 1 2 3 5
3: gold 4 4 4 1
4: a 1 0 1 2
5: energy 5 5 5 3

Text = sparse
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class A
Document 4:
Class A
Document 5:
Class ?
1: acquired 0 0 1 0 0
2: received 0 2 0 0 0
3: collected 1 0 0 0 0
4: a 0 0 0 2 0
5: energy 0 0 0 0 1

Dataset: Reuters news article dataset
100 top words
across whole
corpus
For each
document count
how often each
word occurs
Word 1
Word 2
Word 3
Word 100
Document 1 Document 2
Word 1 0 2
Word 2 3 1
Word 3 4 4
Word 100 1 1
Feature space of
100 dimensions
containing 21578
data points

Dimensionality reduction - Reuters top 100 words

Dimensionality reduction - pipeline
Documents = text + category
Tsne
(dimensionality
reduction)
visualization
Words (100
dimensions) Reduced 2d
vectors
categories

Dimensionality reduction - Mnist

Dimensionality reduction - pipeline
Mnist = picture + class
Tsne
(dimensionality
reduction)
visualization
Pixels (800
dimensions) Reduced 2d
vectors
classes

Data cleaning
● Remove stop words:
○ a
○ the
○ or
● Stemming:
● Remove non alphanumeric characters:
○ $%^@#
○ 😁😂
○ <html> https://

Top 100 words - Data cleaning disabled

Top 100 words - Data cleaning enabled

Data cleaning results
Classificatie score data cleaning off:
0.88
Classificatie score data cleaning on:
0.90

training
Classification score - pipeline
verification
words+categories
20%
80% Trained
model
score

Embedding - doc2vec
Word 1
Word 2
Word 3
….
Word 50000
Document 1
Document 2
…
Document 10000
Word 1
Word 2
Word 3
….
Word 50000

Embedding - doc2vec - example
Word 1
Word 345
Word 1000
Document 245
Word 25
Word 1204
Word 1
Word 345
Word 1000
Document 312
Word 45
Word 1182

Input Hidden Output
Word1
Word2
Word3
Doc1
Doc2
Embedding - doc2vec - example
Word1
Word2
Word3
Word4
Word5

Embedding - pipeline
doc2vec
classification
Text
(10000+
dimensions)
document
features 100
dimensions
categories

Reuters - score doc2vec vs top 100 words
Word count top 100 words:
0.90
Doc2vec:
0.94

IMDB movie reviews - doc2vec vs wordcount
Class: positive
Bromwell High is nothing short of brilliant. Expertly
scripted and perfectly delivered, this searing parody of
a students and teachers at a South London Public
School leaves you literally rolling with laughter. It's
vulgar, provocative, witty and sharp. The characters
are a superbly caricatured cross section of British
society (or to be more accurate, of any society).
Following the escapades of Keisha, Latrina and
Natella, our three "protagonists" for want of a better
term, the show doesn't shy away from parodying every
imaginable subject. Political correctness flies out the
window in every episode. If you enjoy shows that
aren't afraid to poke fun of every taboo subject
imaginable, then Bromwell High will not disappoint!
Class: negative
Robert DeNiro plays the most unbelievably intelligent
illiterate of all time. This movie is so wasteful of talent,
it is truly disgusting. The script is unbelievable. The
dialog is unbelievable. Jane Fonda's character is a
caricature of herself, and not a funny one. The movie
moves at a snail's pace, is photographed in an
ill-advised manner, and is insufferably preachy. It also
plugs in every cliche in the book. Swoozie Kurtz is
excellent in a supporting role, but so what?<br /><br
/>Equally annoying is this new IMDB rule of requiring
ten lines for every review. When a movie is this
worthless, it doesn't require ten lines of text to let other
readers know that it is a waste of time and tape. Avoid
this movie.

Word count top 250 words:
0.72
Doc2vec:
0.83

Conclusion
● It’s all about extracting the right features from your data
● Visualize the data to get a sense of the value of your features
● You can use the same algorithms for text, image, audio and other kinds of
data once it converted to an abstract feature space

Hands-on
● Tweaken pipeline
● Doc2vec similarity
● Tweaken classificatie algoritme

Text classification presentation

More Related Content

What's hot

Similar to Text classification presentation

Recently uploaded

Text classification presentation