Text classification
Kennissessie
Agenda
● Text classification
● Sparse data
○ Dimensionality reduction / visualization sparse data
○ Classification on sparse data
● Text embedding
○ Short explanation doc2vec
○ Visualization sparse vs embedded
○ Classification sparse vs embedded
● Hands-on!
Text classification - Definition
● Text classification is the task of assigning predefined categories to free-text documents.
Example: News article classification
What is the category of this news article?
Classification
Sunken
ships
Example: News article classification
Examples:
Great war
Examples:
Sunken ships
Example: Every word is a feature
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class B
Document 4:
Class B
1: arrived 0 1 4 5
2: received 1 2 3 5
3: gold 4 4 4 1
4: a 1 0 1 2
5: energy 5 5 5 3
Feature vector Feature space
Dimensionality
Features
(one word per feature)
Classes
Text = high dimensional
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class B
Document 4:
Class B
1: arrived 0 1 4 5
2: received 1 2 3 5
3: gold 4 4 4 1
4: a 1 0 1 2
5: energy 5 5 5 3
Text = sparse
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class A
Document 4:
Class A
Document 5:
Class ?
1: acquired 0 0 1 0 0
2: received 0 2 0 0 0
3: collected 1 0 0 0 0
4: a 0 0 0 2 0
5: energy 0 0 0 0 1
Dataset: Reuters news article dataset
100 top words
across whole
corpus
For each
document count
how often each
word occurs
Word 1
Word 2
Word 3
Word 100
Document 1 Document 2
Word 1 0 2
Word 2 3 1
Word 3 4 4
Word 100 1 1
Feature space of
100 dimensions
containing 21578
data points
Dimensionality reduction - Reuters top 100 words
Dimensionality reduction - Reuters top 100 words
Dimensionality reduction - pipeline
Documents = text + category
Tsne
(dimensionality
reduction)
visualization
Words (100
dimensions) Reduced 2d
vectors
categories
Dimensionality reduction - Mnist
Dimensionality reduction - pipeline
Mnist = picture + class
Tsne
(dimensionality
reduction)
visualization
Pixels (800
dimensions) Reduced 2d
vectors
classes
Data cleaning
● Remove stop words:
○ a
○ the
○ or
● Stemming:
● Remove non alphanumeric characters:
○ $%^@#
○ 😁😂
○ <html> https://
Top 100 words - Data cleaning disabled
Top 100 words - Data cleaning enabled
Data cleaning results
Classificatie score data cleaning off:
0.88
Classificatie score data cleaning on:
0.90
Documents = text + category
training
Classification score - pipeline
verification
words+categories
20%
80% Trained
model
score
Embedding - doc2vec
Word 1
Word 2
Word 3
….
Word 50000
Document 1
Document 2
…
Document 10000
Word 1
Word 2
Word 3
….
Word 50000
Embedding - doc2vec - example
Word 1
Word 345
Word 1000
Document 245
Word 25
Word 1204
Word 1
Word 345
Word 1000
Document 312
Word 45
Word 1182
Input Hidden Output
Word1
Word2
Word3
Doc1
Doc2
Embedding - doc2vec - example
Word1
Word2
Word3
Word4
Word5
Embedding - pipeline
Documents = text + category
doc2vec
classification
Text
(10000+
dimensions)
document
features 100
dimensions
categories
Reuters - score doc2vec vs top 100 words
Word count top 100 words:
0.90
Doc2vec:
0.94
IMDB movie reviews - doc2vec vs wordcount
Class: positive
Bromwell High is nothing short of brilliant. Expertly
scripted and perfectly delivered, this searing parody of
a students and teachers at a South London Public
School leaves you literally rolling with laughter. It's
vulgar, provocative, witty and sharp. The characters
are a superbly caricatured cross section of British
society (or to be more accurate, of any society).
Following the escapades of Keisha, Latrina and
Natella, our three "protagonists" for want of a better
term, the show doesn't shy away from parodying every
imaginable subject. Political correctness flies out the
window in every episode. If you enjoy shows that
aren't afraid to poke fun of every taboo subject
imaginable, then Bromwell High will not disappoint!
Class: negative
Robert DeNiro plays the most unbelievably intelligent
illiterate of all time. This movie is so wasteful of talent,
it is truly disgusting. The script is unbelievable. The
dialog is unbelievable. Jane Fonda's character is a
caricature of herself, and not a funny one. The movie
moves at a snail's pace, is photographed in an
ill-advised manner, and is insufferably preachy. It also
plugs in every cliche in the book. Swoozie Kurtz is
excellent in a supporting role, but so what?<br /><br
/>Equally annoying is this new IMDB rule of requiring
ten lines for every review. When a movie is this
worthless, it doesn't require ten lines of text to let other
readers know that it is a waste of time and tape. Avoid
this movie.
IMDB movie reviews - doc2vec vs wordcount
IMDB movie reviews - doc2vec vs wordcount
Word count top 250 words:
0.72
Doc2vec:
0.83
Conclusion
● It’s all about extracting the right features from your data
● Visualize the data to get a sense of the value of your features
● You can use the same algorithms for text, image, audio and other kinds of
data once it converted to an abstract feature space
Hands-on
● Tweaken pipeline
● Doc2vec similarity
● Tweaken classificatie algoritme

Text classification presentation

  • 1.
  • 2.
    Agenda ● Text classification ●Sparse data ○ Dimensionality reduction / visualization sparse data ○ Classification on sparse data ● Text embedding ○ Short explanation doc2vec ○ Visualization sparse vs embedded ○ Classification sparse vs embedded ● Hands-on!
  • 3.
    Text classification -Definition ● Text classification is the task of assigning predefined categories to free-text documents.
  • 4.
    Example: News articleclassification What is the category of this news article?
  • 5.
  • 6.
    Example: News articleclassification Examples: Great war Examples: Sunken ships
  • 7.
    Example: Every wordis a feature Feature dimensions Document 1: Class A Document 2: Class A Document 3: Class B Document 4: Class B 1: arrived 0 1 4 5 2: received 1 2 3 5 3: gold 4 4 4 1 4: a 1 0 1 2 5: energy 5 5 5 3 Feature vector Feature space
  • 8.
  • 9.
    Text = highdimensional Feature dimensions Document 1: Class A Document 2: Class A Document 3: Class B Document 4: Class B 1: arrived 0 1 4 5 2: received 1 2 3 5 3: gold 4 4 4 1 4: a 1 0 1 2 5: energy 5 5 5 3
  • 10.
    Text = sparse Feature dimensions Document1: Class A Document 2: Class A Document 3: Class A Document 4: Class A Document 5: Class ? 1: acquired 0 0 1 0 0 2: received 0 2 0 0 0 3: collected 1 0 0 0 0 4: a 0 0 0 2 0 5: energy 0 0 0 0 1
  • 11.
    Dataset: Reuters newsarticle dataset 100 top words across whole corpus For each document count how often each word occurs Word 1 Word 2 Word 3 Word 100 Document 1 Document 2 Word 1 0 2 Word 2 3 1 Word 3 4 4 Word 100 1 1 Feature space of 100 dimensions containing 21578 data points
  • 12.
    Dimensionality reduction -Reuters top 100 words
  • 13.
    Dimensionality reduction -Reuters top 100 words
  • 14.
    Dimensionality reduction -pipeline Documents = text + category Tsne (dimensionality reduction) visualization Words (100 dimensions) Reduced 2d vectors categories
  • 15.
  • 16.
    Dimensionality reduction -pipeline Mnist = picture + class Tsne (dimensionality reduction) visualization Pixels (800 dimensions) Reduced 2d vectors classes
  • 17.
    Data cleaning ● Removestop words: ○ a ○ the ○ or ● Stemming: ● Remove non alphanumeric characters: ○ $%^@# ○ 😁😂 ○ <html> https://
  • 18.
    Top 100 words- Data cleaning disabled
  • 19.
    Top 100 words- Data cleaning enabled
  • 20.
    Data cleaning results Classificatiescore data cleaning off: 0.88 Classificatie score data cleaning on: 0.90
  • 21.
    Documents = text+ category training Classification score - pipeline verification words+categories 20% 80% Trained model score
  • 22.
    Embedding - doc2vec Word1 Word 2 Word 3 …. Word 50000 Document 1 Document 2 … Document 10000 Word 1 Word 2 Word 3 …. Word 50000
  • 23.
    Embedding - doc2vec- example Word 1 Word 345 Word 1000 Document 245 Word 25 Word 1204 Word 1 Word 345 Word 1000 Document 312 Word 45 Word 1182
  • 24.
    Input Hidden Output Word1 Word2 Word3 Doc1 Doc2 Embedding- doc2vec - example Word1 Word2 Word3 Word4 Word5
  • 25.
    Embedding - pipeline Documents= text + category doc2vec classification Text (10000+ dimensions) document features 100 dimensions categories
  • 26.
    Reuters - scoredoc2vec vs top 100 words Word count top 100 words: 0.90 Doc2vec: 0.94
  • 27.
    IMDB movie reviews- doc2vec vs wordcount Class: positive Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina and Natella, our three "protagonists" for want of a better term, the show doesn't shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren't afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint! Class: negative Robert DeNiro plays the most unbelievably intelligent illiterate of all time. This movie is so wasteful of talent, it is truly disgusting. The script is unbelievable. The dialog is unbelievable. Jane Fonda's character is a caricature of herself, and not a funny one. The movie moves at a snail's pace, is photographed in an ill-advised manner, and is insufferably preachy. It also plugs in every cliche in the book. Swoozie Kurtz is excellent in a supporting role, but so what?<br /><br />Equally annoying is this new IMDB rule of requiring ten lines for every review. When a movie is this worthless, it doesn't require ten lines of text to let other readers know that it is a waste of time and tape. Avoid this movie.
  • 28.
    IMDB movie reviews- doc2vec vs wordcount
  • 29.
    IMDB movie reviews- doc2vec vs wordcount Word count top 250 words: 0.72 Doc2vec: 0.83
  • 30.
    Conclusion ● It’s allabout extracting the right features from your data ● Visualize the data to get a sense of the value of your features ● You can use the same algorithms for text, image, audio and other kinds of data once it converted to an abstract feature space
  • 31.
    Hands-on ● Tweaken pipeline ●Doc2vec similarity ● Tweaken classificatie algoritme