SlideShare a Scribd company logo
1 of 31
Download to read offline
Text classification
Kennissessie
Agenda
● Text classification
● Sparse data
○ Dimensionality reduction / visualization sparse data
○ Classification on sparse data
● Text embedding
○ Short explanation doc2vec
○ Visualization sparse vs embedded
○ Classification sparse vs embedded
● Hands-on!
Text classification - Definition
● Text classification is the task of assigning predefined categories to free-text documents.
Example: News article classification
What is the category of this news article?
Classification
Sunken
ships
Example: News article classification
Examples:
Great war
Examples:
Sunken ships
Example: Every word is a feature
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class B
Document 4:
Class B
1: arrived 0 1 4 5
2: received 1 2 3 5
3: gold 4 4 4 1
4: a 1 0 1 2
5: energy 5 5 5 3
Feature vector Feature space
Dimensionality
Features
(one word per feature)
Classes
Text = high dimensional
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class B
Document 4:
Class B
1: arrived 0 1 4 5
2: received 1 2 3 5
3: gold 4 4 4 1
4: a 1 0 1 2
5: energy 5 5 5 3
Text = sparse
Feature
dimensions
Document 1:
Class A
Document 2:
Class A
Document 3:
Class A
Document 4:
Class A
Document 5:
Class ?
1: acquired 0 0 1 0 0
2: received 0 2 0 0 0
3: collected 1 0 0 0 0
4: a 0 0 0 2 0
5: energy 0 0 0 0 1
Dataset: Reuters news article dataset
100 top words
across whole
corpus
For each
document count
how often each
word occurs
Word 1
Word 2
Word 3
Word 100
Document 1 Document 2
Word 1 0 2
Word 2 3 1
Word 3 4 4
Word 100 1 1
Feature space of
100 dimensions
containing 21578
data points
Dimensionality reduction - Reuters top 100 words
Dimensionality reduction - Reuters top 100 words
Dimensionality reduction - pipeline
Documents = text + category
Tsne
(dimensionality
reduction)
visualization
Words (100
dimensions) Reduced 2d
vectors
categories
Dimensionality reduction - Mnist
Dimensionality reduction - pipeline
Mnist = picture + class
Tsne
(dimensionality
reduction)
visualization
Pixels (800
dimensions) Reduced 2d
vectors
classes
Data cleaning
● Remove stop words:
○ a
○ the
○ or
● Stemming:
● Remove non alphanumeric characters:
○ $%^@#
○ 😁😂
○ <html> https://
Top 100 words - Data cleaning disabled
Top 100 words - Data cleaning enabled
Data cleaning results
Classificatie score data cleaning off:
0.88
Classificatie score data cleaning on:
0.90
Documents = text + category
training
Classification score - pipeline
verification
words+categories
20%
80% Trained
model
score
Embedding - doc2vec
Word 1
Word 2
Word 3
….
Word 50000
Document 1
Document 2
…
Document 10000
Word 1
Word 2
Word 3
….
Word 50000
Embedding - doc2vec - example
Word 1
Word 345
Word 1000
Document 245
Word 25
Word 1204
Word 1
Word 345
Word 1000
Document 312
Word 45
Word 1182
Input Hidden Output
Word1
Word2
Word3
Doc1
Doc2
Embedding - doc2vec - example
Word1
Word2
Word3
Word4
Word5
Embedding - pipeline
Documents = text + category
doc2vec
classification
Text
(10000+
dimensions)
document
features 100
dimensions
categories
Reuters - score doc2vec vs top 100 words
Word count top 100 words:
0.90
Doc2vec:
0.94
IMDB movie reviews - doc2vec vs wordcount
Class: positive
Bromwell High is nothing short of brilliant. Expertly
scripted and perfectly delivered, this searing parody of
a students and teachers at a South London Public
School leaves you literally rolling with laughter. It's
vulgar, provocative, witty and sharp. The characters
are a superbly caricatured cross section of British
society (or to be more accurate, of any society).
Following the escapades of Keisha, Latrina and
Natella, our three "protagonists" for want of a better
term, the show doesn't shy away from parodying every
imaginable subject. Political correctness flies out the
window in every episode. If you enjoy shows that
aren't afraid to poke fun of every taboo subject
imaginable, then Bromwell High will not disappoint!
Class: negative
Robert DeNiro plays the most unbelievably intelligent
illiterate of all time. This movie is so wasteful of talent,
it is truly disgusting. The script is unbelievable. The
dialog is unbelievable. Jane Fonda's character is a
caricature of herself, and not a funny one. The movie
moves at a snail's pace, is photographed in an
ill-advised manner, and is insufferably preachy. It also
plugs in every cliche in the book. Swoozie Kurtz is
excellent in a supporting role, but so what?<br /><br
/>Equally annoying is this new IMDB rule of requiring
ten lines for every review. When a movie is this
worthless, it doesn't require ten lines of text to let other
readers know that it is a waste of time and tape. Avoid
this movie.
IMDB movie reviews - doc2vec vs wordcount
IMDB movie reviews - doc2vec vs wordcount
Word count top 250 words:
0.72
Doc2vec:
0.83
Conclusion
● It’s all about extracting the right features from your data
● Visualize the data to get a sense of the value of your features
● You can use the same algorithms for text, image, audio and other kinds of
data once it converted to an abstract feature space
Hands-on
● Tweaken pipeline
● Doc2vec similarity
● Tweaken classificatie algoritme

More Related Content

What's hot

Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
KU Leuven
 

What's hot (20)

Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
Word embedding
Word embedding Word embedding
Word embedding
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Bert
BertBert
Bert
 
Text Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion MiningText Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion Mining
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Text MIning
Text MIningText MIning
Text MIning
 
Encodings
EncodingsEncodings
Encodings
 
NLP
NLPNLP
NLP
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
 

Similar to Text classification presentation

Slides
SlidesSlides
Slides
butest
 
Semantic Recognition of Ontology Refactoring
Semantic Recognition of Ontology RefactoringSemantic Recognition of Ontology Refactoring
Semantic Recognition of Ontology Refactoring
Gerd Groener
 

Similar to Text classification presentation (20)

TRECVID 2016 : Video to Text Description
TRECVID 2016 : Video to Text DescriptionTRECVID 2016 : Video to Text Description
TRECVID 2016 : Video to Text Description
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
 
SQL to NoSQL Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
SQL to NoSQL   Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...SQL to NoSQL   Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
SQL to NoSQL Best Practices with Amazon DynamoDB - AWS July 2016 Webinar Se...
 
Death to project documentation with eXtreme Programming
Death to project documentation with eXtreme ProgrammingDeath to project documentation with eXtreme Programming
Death to project documentation with eXtreme Programming
 
From DOT to Dotty
From DOT to DottyFrom DOT to Dotty
From DOT to Dotty
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...
Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...
Living Documentation (NCrafts Paris 2015, DDDx London 2015, BDX.io 2015, Code...
 
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
 
Improving classification accuracy for customer contact transcriptions
Improving classification accuracy for customer contact transcriptionsImproving classification accuracy for customer contact transcriptions
Improving classification accuracy for customer contact transcriptions
 
Slides
SlidesSlides
Slides
 
Semantic similarity for faster Knowledge Graph delivery at scale
Semantic similarity for faster Knowledge Graph delivery at scaleSemantic similarity for faster Knowledge Graph delivery at scale
Semantic similarity for faster Knowledge Graph delivery at scale
 
Entities for Augmented Intelligence
Entities for Augmented IntelligenceEntities for Augmented Intelligence
Entities for Augmented Intelligence
 
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
 
Transformer_Clustering_PyData_2022.pdf
Transformer_Clustering_PyData_2022.pdfTransformer_Clustering_PyData_2022.pdf
Transformer_Clustering_PyData_2022.pdf
 
Evaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented SearchEvaluation Initiatives for Entity-oriented Search
Evaluation Initiatives for Entity-oriented Search
 
What to do when one size does not fit all?!
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!
 
Lecture 10
Lecture 10Lecture 10
Lecture 10
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Datatypes in C Language
Datatypes in C LanguageDatatypes in C Language
Datatypes in C Language
 
Semantic Recognition of Ontology Refactoring
Semantic Recognition of Ontology RefactoringSemantic Recognition of Ontology Refactoring
Semantic Recognition of Ontology Refactoring
 

Recently uploaded

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Recently uploaded (20)

Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital BusinessesWSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital Businesses
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and ApplicationsWSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next IntegrationWSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million PeopleWSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
 

Text classification presentation

  • 2. Agenda ● Text classification ● Sparse data ○ Dimensionality reduction / visualization sparse data ○ Classification on sparse data ● Text embedding ○ Short explanation doc2vec ○ Visualization sparse vs embedded ○ Classification sparse vs embedded ● Hands-on!
  • 3. Text classification - Definition ● Text classification is the task of assigning predefined categories to free-text documents.
  • 4. Example: News article classification What is the category of this news article?
  • 6. Example: News article classification Examples: Great war Examples: Sunken ships
  • 7. Example: Every word is a feature Feature dimensions Document 1: Class A Document 2: Class A Document 3: Class B Document 4: Class B 1: arrived 0 1 4 5 2: received 1 2 3 5 3: gold 4 4 4 1 4: a 1 0 1 2 5: energy 5 5 5 3 Feature vector Feature space
  • 9. Text = high dimensional Feature dimensions Document 1: Class A Document 2: Class A Document 3: Class B Document 4: Class B 1: arrived 0 1 4 5 2: received 1 2 3 5 3: gold 4 4 4 1 4: a 1 0 1 2 5: energy 5 5 5 3
  • 10. Text = sparse Feature dimensions Document 1: Class A Document 2: Class A Document 3: Class A Document 4: Class A Document 5: Class ? 1: acquired 0 0 1 0 0 2: received 0 2 0 0 0 3: collected 1 0 0 0 0 4: a 0 0 0 2 0 5: energy 0 0 0 0 1
  • 11. Dataset: Reuters news article dataset 100 top words across whole corpus For each document count how often each word occurs Word 1 Word 2 Word 3 Word 100 Document 1 Document 2 Word 1 0 2 Word 2 3 1 Word 3 4 4 Word 100 1 1 Feature space of 100 dimensions containing 21578 data points
  • 12. Dimensionality reduction - Reuters top 100 words
  • 13. Dimensionality reduction - Reuters top 100 words
  • 14. Dimensionality reduction - pipeline Documents = text + category Tsne (dimensionality reduction) visualization Words (100 dimensions) Reduced 2d vectors categories
  • 16. Dimensionality reduction - pipeline Mnist = picture + class Tsne (dimensionality reduction) visualization Pixels (800 dimensions) Reduced 2d vectors classes
  • 17. Data cleaning ● Remove stop words: ○ a ○ the ○ or ● Stemming: ● Remove non alphanumeric characters: ○ $%^@# ○ 😁😂 ○ <html> https://
  • 18. Top 100 words - Data cleaning disabled
  • 19. Top 100 words - Data cleaning enabled
  • 20. Data cleaning results Classificatie score data cleaning off: 0.88 Classificatie score data cleaning on: 0.90
  • 21. Documents = text + category training Classification score - pipeline verification words+categories 20% 80% Trained model score
  • 22. Embedding - doc2vec Word 1 Word 2 Word 3 …. Word 50000 Document 1 Document 2 … Document 10000 Word 1 Word 2 Word 3 …. Word 50000
  • 23. Embedding - doc2vec - example Word 1 Word 345 Word 1000 Document 245 Word 25 Word 1204 Word 1 Word 345 Word 1000 Document 312 Word 45 Word 1182
  • 24. Input Hidden Output Word1 Word2 Word3 Doc1 Doc2 Embedding - doc2vec - example Word1 Word2 Word3 Word4 Word5
  • 25. Embedding - pipeline Documents = text + category doc2vec classification Text (10000+ dimensions) document features 100 dimensions categories
  • 26. Reuters - score doc2vec vs top 100 words Word count top 100 words: 0.90 Doc2vec: 0.94
  • 27. IMDB movie reviews - doc2vec vs wordcount Class: positive Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It's vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (or to be more accurate, of any society). Following the escapades of Keisha, Latrina and Natella, our three "protagonists" for want of a better term, the show doesn't shy away from parodying every imaginable subject. Political correctness flies out the window in every episode. If you enjoy shows that aren't afraid to poke fun of every taboo subject imaginable, then Bromwell High will not disappoint! Class: negative Robert DeNiro plays the most unbelievably intelligent illiterate of all time. This movie is so wasteful of talent, it is truly disgusting. The script is unbelievable. The dialog is unbelievable. Jane Fonda's character is a caricature of herself, and not a funny one. The movie moves at a snail's pace, is photographed in an ill-advised manner, and is insufferably preachy. It also plugs in every cliche in the book. Swoozie Kurtz is excellent in a supporting role, but so what?<br /><br />Equally annoying is this new IMDB rule of requiring ten lines for every review. When a movie is this worthless, it doesn't require ten lines of text to let other readers know that it is a waste of time and tape. Avoid this movie.
  • 28. IMDB movie reviews - doc2vec vs wordcount
  • 29. IMDB movie reviews - doc2vec vs wordcount Word count top 250 words: 0.72 Doc2vec: 0.83
  • 30. Conclusion ● It’s all about extracting the right features from your data ● Visualize the data to get a sense of the value of your features ● You can use the same algorithms for text, image, audio and other kinds of data once it converted to an abstract feature space
  • 31. Hands-on ● Tweaken pipeline ● Doc2vec similarity ● Tweaken classificatie algoritme