SlideShare a Scribd company logo
Text classification
outlines
 Purpose of search
 Introduction
 Applications of text classification
 Approaches and methods in text classification
 Summary
.
2
Purpose of search
State of the art for text
classification problem
3
Introduction
4
Text mining
Introduction
 TC is one of the important fields in natural language processing.
 Text Classification assigns one or more classes to a document
according to their content.
5
Applications of text
classification
o CRM tasks.
o Social media
o E-mail spam filtering
o Sentiment Analysis
o Commercial world
o Question answering systems Dialogue Agents
o other
6
Approaches and methods in
text classification
7

Rule-base
Or rule classification , use
rules to classify text.
Methods in text classification
Statistical
use machine learning
and deep learning .
.
8
Machine learning for text
classification
use BOW as a way of extracting features from
text for use in ml algorithms
9
10
Machine learning algorithms for
text classification
 Decision Trees.
 Support Vector Machine.
 Naïve Bayes.
 K-Nearest Neighbors.
 Hidden Markov model.
11
Decision Trees
 A decision tree is a tree whose internal nodes are tests and
whose leaf nodes are categories .
 capable to learn disjunctive expressions and their
robustness to noisy data seem convenient for document
classification.
 learning DT cannot guarantee to return
the globally optimal decision tree .
 High cost .
12
Decision Trees
▷ Harrag, El-Qawasmeh & Pichappan :use decision tree for
Arabic text classification They suggested hybrid techniques
of document frequency threshold by using embedded
information gain criterion and the preferable feature
selection criterion.
▷ Vateekul & Kubat : worked on Imbalanced, Large Scale,
and Multi-label Data , try to reduce these costs by FDT
("fast decision - tree induction") .
▷ Johnson, Oles, Zhang & Goetz (2002) : performed
combination of a FDT and a modern method for converting
a decision tree to a rule set .
13
K-Nearest Neighbors
▷ applied to text categorization in early 90's strong baseline
in benchmark evaluations
▷ among top-performing methods in TC evaluations scalable
to large TC applications.
▷ Also called:
○ Case-based learning
○ Memory-based learning
○ Lazy learning
14
K-Nearest Neighbors
▷ Using only the closest example to determine the categorization is
subject to errors due to:
○ A single atypical example.
○ Noise (i.e. error) in the category label of a single training
example.
▷ More robust alternative is to find the k most-similar examples and
return the majority category of these k examples.
▷ Value of k is typically odd to avoid
ties; 3 and 5 are most common.
▷ No feature selection necessary
15
KNN
 Hierarchical KNN (high performance with
small and large dataset) with two steps:
 Step1: select high K
 Step2 : select neighbor
features.
 KNN with indexing documents by N-gram
(unigrams and bigrams)
 KNN(with K-means) for grouping into
clusters then Weighted
16
Naïve Bayes
 Simple ,common and very fast.
 Baseline
 Naïve Bayes is not so naïve , A good dependable baseline for text
classification (but not the best)!
 Very good in Domains with many equally important features.
 popular for document categorization.
 Conditional independence assumption
 Features are independent of each other given in the class.
 Need very large training examples.
17
Naïve Bayes
 Singhal & Sharma: eliminating features leads to improved
performance.
 Posteriori with dependency between features and Reduce
dimensions of features.
 Use NB without features independence assumption and split
related features (high performance with increase dataset).
18
Hidden Markov model
 HMM is one sequential model of text .
 A simple process to generate a sequence of
words.
 Classification is not simple.
 generate states y1,...,yn
 generate words w1,..,wn from Pr(W|Y=yi)
19
HMM
 Frasconi, Soda &Vullo : represent documents as series of
pages(high performance with large documents )
 Use “Minimum Message length estimator“ for optimal
number of states for higher
performance .
20
Support Vector Machine
 was proposed by Vapnik, provides "a maximal margin
separating hyper plane" between two classes of data and
has non-linear extensions
 represents the text document as a vector .
 A popular supervised learning model used for binary
classification.
▷ Why SVM?
○ High dimensional input space
○ Few irrelevant features
○ Sparse document vectors
21
SVM
 Yao & Fan : use weighted kernel function depended on
features of the training data for interference detection.
 Rennie &Rifkin : to the task of classifying multilayered text .
 Joseph , Yun and Yanqing(2015) :Use Word2Vector
representation with SVM for Semantic Features.
22
Deep learning for text
classification
No feature extraction
23
Deep learning
 In ~2010 DL started outperforming other ML techniques .
 first in speech and vision, then NLP.
 Several big improvements in recent years in NLP .
 Leverage different levels of representation.
 words & characters.
 syntax & semantics
24
25
o Manually designed
features are often over-
specified, incomplete
and take a long time to
design and validate.
o Learned Features are
easy to adapt, fast to
learn
Deep learning –why?
o Can learn both supervised
unsupervised and.
o Deep learning provides a
very flexible, (almost?)
universal, learnable
framework for representing
world, visual and linguistic
information.
Convolution NN
 Convolutional Neural Networks (CNNs -2014)
 Main CNN idea for text: Compute vectors for n-grams and group
them afterwards
 Use Single 1-dimensional convolution layer followed by a max pooling
layer combining neighboring vectors.
 Goal is to learn a region based text embedding.
 fast in training and powerful in text classification .
 learning an optimal kernel size is challenging.
26
Convolution NN
27
Recurrent NN
 Recurrent NN has obtained much attention because of their
superior ability to pr Tai et al. (2015) generalized LSTM to
Tree-LSTM where each LSTM unit gains information from
its children units. reserve sequence information over time.
 Has ability to remember long sequence , has forget gates .
 Has High cost(O(n2))
28
Bidirectional LSTM
▷ It involves duplicating the first recurrent layer in
the network.
▷ remarkable performance in sentences more
than in documents
29
Recurrent Convolutional NN(2015)
▷ capture contextual information by maintaining a state of all previous
inputs.
▷ remarkable performance in documents classification.
30
AC-BLSTM
▷ Asymmetric Convolutional Bidirectional LSTM (AC-BLSTM -
2017).
▷ remarkable performance in sentences and documents
classification tasks.
31
32
Hierarchical
Attention
Networks
Hierarchical Attention Networks
▷ HAN(2016).
▷ Assume that a document has L sentences Si and each
sentence contains Ti words.
▷ It consists of several parts:
○ a word sequence encoder
○ a word-level attention layer
○ a sentence encoder and
○ a sentence-level attention layer.
33
Rule classification
34
Rule base
▷ based on linguistic rules that capture all of the elements and
attributes of a document to assign it to a category.
▷ A rules-based approach is flexible, powerful and easy to express.
▷ Required understanding of text (meaning, relevancy, relationship
between concepts, etc.)
▷ Provides a true representation of the language.
▷ Supports writing simpler rules with a higher level of abstraction.
▷ Makes it easier to improve accuracy over time
▷ But… not for very large rules .
▷ Old method ,but used.
35
Thanks!
Any questions?
36

More Related Content

What's hot

Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
Bhaskar Mitra
 
Text categorization
Text categorizationText categorization
Text categorization
Shubham Pahune
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
Nik Spirin
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Yuki Tomo
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
Glenn De Backer
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
KozoChikai
 
Text Classification
Text ClassificationText Classification
Text Classification
RAX Automation Suite
 
Text categorization as graph
Text categorization as graphText categorization as graph
Text categorization as graph
Harry Potter
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
Bhaskar Mitra
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
Sebastian Ruder
 
Classifying Text using CNN
Classifying Text using CNNClassifying Text using CNN
Classifying Text using CNN
Somnath Banerjee
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
Farheen Nilofer
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
Sebastian Ruder
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Bhaskar Mitra
 
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTINGFAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
IJNSA Journal
 
Language models
Language modelsLanguage models
Language models
Maryam Khordad
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
jins0618
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
Innovation Quotient Pvt Ltd
 
The Duet model
The Duet modelThe Duet model
The Duet model
Bhaskar Mitra
 

What's hot (20)

Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Text categorization
Text categorizationText categorization
Text categorization
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Text categorization as graph
Text categorization as graphText categorization as graph
Text categorization as graph
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 
Classifying Text using CNN
Classifying Text using CNNClassifying Text using CNN
Classifying Text using CNN
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTINGFAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
FAST DETECTION OF DDOS ATTACKS USING NON-ADAPTIVE GROUP TESTING
 
Language models
Language modelsLanguage models
Language models
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 
The Duet model
The Duet modelThe Duet model
The Duet model
 

Similar to Seminar dm

A technical paper presentation on Evaluation of Deep Learning techniques in S...
A technical paper presentation on Evaluation of Deep Learning techniques in S...A technical paper presentation on Evaluation of Deep Learning techniques in S...
A technical paper presentation on Evaluation of Deep Learning techniques in S...
VarshaR19
 
What to read next? Challenges and Preliminary Results in Selecting Represen...
What to read next? Challenges and  Preliminary Results in Selecting  Represen...What to read next? Challenges and  Preliminary Results in Selecting  Represen...
What to read next? Challenges and Preliminary Results in Selecting Represen...
MOVING Project
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
Sharvil Katariya
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
IJDKP
 
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
Nishant Kumar
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
butest
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
NAVER Engineering
 
Knowledge distillation deeplab
Knowledge distillation deeplabKnowledge distillation deeplab
Knowledge distillation deeplab
Frozen Paradise
 
neural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationneural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classification
JEE HYUN PARK
 
Icon18revrec sudeshna
Icon18revrec sudeshnaIcon18revrec sudeshna
Icon18revrec sudeshna
Muthusamy Chelliah
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
indico data
 
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
広樹 本間
 
Hate speech detection
Hate speech detectionHate speech detection
Hate speech detection
NASIM ALAM
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A Survey
NUPUR YADAV
 
Optimizer algorithms and convolutional neural networks for text classification
Optimizer algorithms and convolutional neural networks for text classificationOptimizer algorithms and convolutional neural networks for text classification
Optimizer algorithms and convolutional neural networks for text classification
IAESIJAI
 
Ir 09
Ir   09Ir   09
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Pramati Technologies
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET Journal
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Nimrita Koul
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Márton Miháltz
 

Similar to Seminar dm (20)

A technical paper presentation on Evaluation of Deep Learning techniques in S...
A technical paper presentation on Evaluation of Deep Learning techniques in S...A technical paper presentation on Evaluation of Deep Learning techniques in S...
A technical paper presentation on Evaluation of Deep Learning techniques in S...
 
What to read next? Challenges and Preliminary Results in Selecting Represen...
What to read next? Challenges and  Preliminary Results in Selecting  Represen...What to read next? Challenges and  Preliminary Results in Selecting  Represen...
What to read next? Challenges and Preliminary Results in Selecting Represen...
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
Knowledge distillation deeplab
Knowledge distillation deeplabKnowledge distillation deeplab
Knowledge distillation deeplab
 
neural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationneural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classification
 
Icon18revrec sudeshna
Icon18revrec sudeshnaIcon18revrec sudeshna
Icon18revrec sudeshna
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
 
Hate speech detection
Hate speech detectionHate speech detection
Hate speech detection
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A Survey
 
Optimizer algorithms and convolutional neural networks for text classification
Optimizer algorithms and convolutional neural networks for text classificationOptimizer algorithms and convolutional neural networks for text classification
Optimizer algorithms and convolutional neural networks for text classification
 
Ir 09
Ir   09Ir   09
Ir 09
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
 

Recently uploaded

PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Rebecca Bilbro
 
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdfsaps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
newdirectionconsulta
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
Do People Really Know Their Fertility Intentions? Correspondence between Sel...
Do People Really Know Their Fertility Intentions?  Correspondence between Sel...Do People Really Know Their Fertility Intentions?  Correspondence between Sel...
Do People Really Know Their Fertility Intentions? Correspondence between Sel...
Xiao Xu
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
 
Senior Engineering Sample EM DOE - Sheet1.pdf
Senior Engineering Sample EM DOE  - Sheet1.pdfSenior Engineering Sample EM DOE  - Sheet1.pdf
Senior Engineering Sample EM DOE - Sheet1.pdf
Vineet
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
ArshadAyub49
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
zoykygu
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 

Recently uploaded (20)

PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
 
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdfsaps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
Do People Really Know Their Fertility Intentions? Correspondence between Sel...
Do People Really Know Their Fertility Intentions?  Correspondence between Sel...Do People Really Know Their Fertility Intentions?  Correspondence between Sel...
Do People Really Know Their Fertility Intentions? Correspondence between Sel...
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
 
Senior Engineering Sample EM DOE - Sheet1.pdf
Senior Engineering Sample EM DOE  - Sheet1.pdfSenior Engineering Sample EM DOE  - Sheet1.pdf
Senior Engineering Sample EM DOE - Sheet1.pdf
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 

Seminar dm

  • 2. outlines  Purpose of search  Introduction  Applications of text classification  Approaches and methods in text classification  Summary . 2
  • 3. Purpose of search State of the art for text classification problem 3
  • 5. Introduction  TC is one of the important fields in natural language processing.  Text Classification assigns one or more classes to a document according to their content. 5
  • 6. Applications of text classification o CRM tasks. o Social media o E-mail spam filtering o Sentiment Analysis o Commercial world o Question answering systems Dialogue Agents o other 6
  • 7. Approaches and methods in text classification 7 
  • 8. Rule-base Or rule classification , use rules to classify text. Methods in text classification Statistical use machine learning and deep learning . . 8
  • 9. Machine learning for text classification use BOW as a way of extracting features from text for use in ml algorithms 9
  • 10. 10
  • 11. Machine learning algorithms for text classification  Decision Trees.  Support Vector Machine.  Naïve Bayes.  K-Nearest Neighbors.  Hidden Markov model. 11
  • 12. Decision Trees  A decision tree is a tree whose internal nodes are tests and whose leaf nodes are categories .  capable to learn disjunctive expressions and their robustness to noisy data seem convenient for document classification.  learning DT cannot guarantee to return the globally optimal decision tree .  High cost . 12
  • 13. Decision Trees ▷ Harrag, El-Qawasmeh & Pichappan :use decision tree for Arabic text classification They suggested hybrid techniques of document frequency threshold by using embedded information gain criterion and the preferable feature selection criterion. ▷ Vateekul & Kubat : worked on Imbalanced, Large Scale, and Multi-label Data , try to reduce these costs by FDT ("fast decision - tree induction") . ▷ Johnson, Oles, Zhang & Goetz (2002) : performed combination of a FDT and a modern method for converting a decision tree to a rule set . 13
  • 14. K-Nearest Neighbors ▷ applied to text categorization in early 90's strong baseline in benchmark evaluations ▷ among top-performing methods in TC evaluations scalable to large TC applications. ▷ Also called: ○ Case-based learning ○ Memory-based learning ○ Lazy learning 14
  • 15. K-Nearest Neighbors ▷ Using only the closest example to determine the categorization is subject to errors due to: ○ A single atypical example. ○ Noise (i.e. error) in the category label of a single training example. ▷ More robust alternative is to find the k most-similar examples and return the majority category of these k examples. ▷ Value of k is typically odd to avoid ties; 3 and 5 are most common. ▷ No feature selection necessary 15
  • 16. KNN  Hierarchical KNN (high performance with small and large dataset) with two steps:  Step1: select high K  Step2 : select neighbor features.  KNN with indexing documents by N-gram (unigrams and bigrams)  KNN(with K-means) for grouping into clusters then Weighted 16
  • 17. Naïve Bayes  Simple ,common and very fast.  Baseline  Naïve Bayes is not so naïve , A good dependable baseline for text classification (but not the best)!  Very good in Domains with many equally important features.  popular for document categorization.  Conditional independence assumption  Features are independent of each other given in the class.  Need very large training examples. 17
  • 18. Naïve Bayes  Singhal & Sharma: eliminating features leads to improved performance.  Posteriori with dependency between features and Reduce dimensions of features.  Use NB without features independence assumption and split related features (high performance with increase dataset). 18
  • 19. Hidden Markov model  HMM is one sequential model of text .  A simple process to generate a sequence of words.  Classification is not simple.  generate states y1,...,yn  generate words w1,..,wn from Pr(W|Y=yi) 19
  • 20. HMM  Frasconi, Soda &Vullo : represent documents as series of pages(high performance with large documents )  Use “Minimum Message length estimator“ for optimal number of states for higher performance . 20
  • 21. Support Vector Machine  was proposed by Vapnik, provides "a maximal margin separating hyper plane" between two classes of data and has non-linear extensions  represents the text document as a vector .  A popular supervised learning model used for binary classification. ▷ Why SVM? ○ High dimensional input space ○ Few irrelevant features ○ Sparse document vectors 21
  • 22. SVM  Yao & Fan : use weighted kernel function depended on features of the training data for interference detection.  Rennie &Rifkin : to the task of classifying multilayered text .  Joseph , Yun and Yanqing(2015) :Use Word2Vector representation with SVM for Semantic Features. 22
  • 23. Deep learning for text classification No feature extraction 23
  • 24. Deep learning  In ~2010 DL started outperforming other ML techniques .  first in speech and vision, then NLP.  Several big improvements in recent years in NLP .  Leverage different levels of representation.  words & characters.  syntax & semantics 24
  • 25. 25 o Manually designed features are often over- specified, incomplete and take a long time to design and validate. o Learned Features are easy to adapt, fast to learn Deep learning –why? o Can learn both supervised unsupervised and. o Deep learning provides a very flexible, (almost?) universal, learnable framework for representing world, visual and linguistic information.
  • 26. Convolution NN  Convolutional Neural Networks (CNNs -2014)  Main CNN idea for text: Compute vectors for n-grams and group them afterwards  Use Single 1-dimensional convolution layer followed by a max pooling layer combining neighboring vectors.  Goal is to learn a region based text embedding.  fast in training and powerful in text classification .  learning an optimal kernel size is challenging. 26
  • 28. Recurrent NN  Recurrent NN has obtained much attention because of their superior ability to pr Tai et al. (2015) generalized LSTM to Tree-LSTM where each LSTM unit gains information from its children units. reserve sequence information over time.  Has ability to remember long sequence , has forget gates .  Has High cost(O(n2)) 28
  • 29. Bidirectional LSTM ▷ It involves duplicating the first recurrent layer in the network. ▷ remarkable performance in sentences more than in documents 29
  • 30. Recurrent Convolutional NN(2015) ▷ capture contextual information by maintaining a state of all previous inputs. ▷ remarkable performance in documents classification. 30
  • 31. AC-BLSTM ▷ Asymmetric Convolutional Bidirectional LSTM (AC-BLSTM - 2017). ▷ remarkable performance in sentences and documents classification tasks. 31
  • 33. Hierarchical Attention Networks ▷ HAN(2016). ▷ Assume that a document has L sentences Si and each sentence contains Ti words. ▷ It consists of several parts: ○ a word sequence encoder ○ a word-level attention layer ○ a sentence encoder and ○ a sentence-level attention layer. 33
  • 35. Rule base ▷ based on linguistic rules that capture all of the elements and attributes of a document to assign it to a category. ▷ A rules-based approach is flexible, powerful and easy to express. ▷ Required understanding of text (meaning, relevancy, relationship between concepts, etc.) ▷ Provides a true representation of the language. ▷ Supports writing simpler rules with a higher level of abstraction. ▷ Makes it easier to improve accuracy over time ▷ But… not for very large rules . ▷ Old method ,but used. 35