Text classification
outlines
 Purpose of search
 Introduction
 Applications of text classification
 Approaches and methods in text classification
 Summary
.
2
Purpose of search
State of the art for text
classification problem
3
Introduction
4
Text mining
Introduction
 TC is one of the important fields in natural language processing.
 Text Classification assigns one or more classes to a document
according to their content.
5
Applications of text
classification
o CRM tasks.
o Social media
o E-mail spam filtering
o Sentiment Analysis
o Commercial world
o Question answering systems Dialogue Agents
o other
6
Approaches and methods in
text classification
7

Rule-base
Or rule classification , use
rules to classify text.
Methods in text classification
Statistical
use machine learning
and deep learning .
.
8
Machine learning for text
classification
use BOW as a way of extracting features from
text for use in ml algorithms
9
10
Machine learning algorithms for
text classification
 Decision Trees.
 Support Vector Machine.
 Naïve Bayes.
 K-Nearest Neighbors.
 Hidden Markov model.
11
Decision Trees
 A decision tree is a tree whose internal nodes are tests and
whose leaf nodes are categories .
 capable to learn disjunctive expressions and their
robustness to noisy data seem convenient for document
classification.
 learning DT cannot guarantee to return
the globally optimal decision tree .
 High cost .
12
Decision Trees
▷ Harrag, El-Qawasmeh & Pichappan :use decision tree for
Arabic text classification They suggested hybrid techniques
of document frequency threshold by using embedded
information gain criterion and the preferable feature
selection criterion.
▷ Vateekul & Kubat : worked on Imbalanced, Large Scale,
and Multi-label Data , try to reduce these costs by FDT
("fast decision - tree induction") .
▷ Johnson, Oles, Zhang & Goetz (2002) : performed
combination of a FDT and a modern method for converting
a decision tree to a rule set .
13
K-Nearest Neighbors
▷ applied to text categorization in early 90's strong baseline
in benchmark evaluations
▷ among top-performing methods in TC evaluations scalable
to large TC applications.
▷ Also called:
○ Case-based learning
○ Memory-based learning
○ Lazy learning
14
K-Nearest Neighbors
▷ Using only the closest example to determine the categorization is
subject to errors due to:
○ A single atypical example.
○ Noise (i.e. error) in the category label of a single training
example.
▷ More robust alternative is to find the k most-similar examples and
return the majority category of these k examples.
▷ Value of k is typically odd to avoid
ties; 3 and 5 are most common.
▷ No feature selection necessary
15
KNN
 Hierarchical KNN (high performance with
small and large dataset) with two steps:
 Step1: select high K
 Step2 : select neighbor
features.
 KNN with indexing documents by N-gram
(unigrams and bigrams)
 KNN(with K-means) for grouping into
clusters then Weighted
16
Naïve Bayes
 Simple ,common and very fast.
 Baseline
 Naïve Bayes is not so naïve , A good dependable baseline for text
classification (but not the best)!
 Very good in Domains with many equally important features.
 popular for document categorization.
 Conditional independence assumption
 Features are independent of each other given in the class.
 Need very large training examples.
17
Naïve Bayes
 Singhal & Sharma: eliminating features leads to improved
performance.
 Posteriori with dependency between features and Reduce
dimensions of features.
 Use NB without features independence assumption and split
related features (high performance with increase dataset).
18
Hidden Markov model
 HMM is one sequential model of text .
 A simple process to generate a sequence of
words.
 Classification is not simple.
 generate states y1,...,yn
 generate words w1,..,wn from Pr(W|Y=yi)
19
HMM
 Frasconi, Soda &Vullo : represent documents as series of
pages(high performance with large documents )
 Use “Minimum Message length estimator“ for optimal
number of states for higher
performance .
20
Support Vector Machine
 was proposed by Vapnik, provides "a maximal margin
separating hyper plane" between two classes of data and
has non-linear extensions
 represents the text document as a vector .
 A popular supervised learning model used for binary
classification.
▷ Why SVM?
○ High dimensional input space
○ Few irrelevant features
○ Sparse document vectors
21
SVM
 Yao & Fan : use weighted kernel function depended on
features of the training data for interference detection.
 Rennie &Rifkin : to the task of classifying multilayered text .
 Joseph , Yun and Yanqing(2015) :Use Word2Vector
representation with SVM for Semantic Features.
22
Deep learning for text
classification
No feature extraction
23
Deep learning
 In ~2010 DL started outperforming other ML techniques .
 first in speech and vision, then NLP.
 Several big improvements in recent years in NLP .
 Leverage different levels of representation.
 words & characters.
 syntax & semantics
24
25
o Manually designed
features are often over-
specified, incomplete
and take a long time to
design and validate.
o Learned Features are
easy to adapt, fast to
learn
Deep learning –why?
o Can learn both supervised
unsupervised and.
o Deep learning provides a
very flexible, (almost?)
universal, learnable
framework for representing
world, visual and linguistic
information.
Convolution NN
 Convolutional Neural Networks (CNNs -2014)
 Main CNN idea for text: Compute vectors for n-grams and group
them afterwards
 Use Single 1-dimensional convolution layer followed by a max pooling
layer combining neighboring vectors.
 Goal is to learn a region based text embedding.
 fast in training and powerful in text classification .
 learning an optimal kernel size is challenging.
26
Convolution NN
27
Recurrent NN
 Recurrent NN has obtained much attention because of their
superior ability to pr Tai et al. (2015) generalized LSTM to
Tree-LSTM where each LSTM unit gains information from
its children units. reserve sequence information over time.
 Has ability to remember long sequence , has forget gates .
 Has High cost(O(n2))
28
Bidirectional LSTM
▷ It involves duplicating the first recurrent layer in
the network.
▷ remarkable performance in sentences more
than in documents
29
Recurrent Convolutional NN(2015)
▷ capture contextual information by maintaining a state of all previous
inputs.
▷ remarkable performance in documents classification.
30
AC-BLSTM
▷ Asymmetric Convolutional Bidirectional LSTM (AC-BLSTM -
2017).
▷ remarkable performance in sentences and documents
classification tasks.
31
32
Hierarchical
Attention
Networks
Hierarchical Attention Networks
▷ HAN(2016).
▷ Assume that a document has L sentences Si and each
sentence contains Ti words.
▷ It consists of several parts:
○ a word sequence encoder
○ a word-level attention layer
○ a sentence encoder and
○ a sentence-level attention layer.
33
Rule classification
34
Rule base
▷ based on linguistic rules that capture all of the elements and
attributes of a document to assign it to a category.
▷ A rules-based approach is flexible, powerful and easy to express.
▷ Required understanding of text (meaning, relevancy, relationship
between concepts, etc.)
▷ Provides a true representation of the language.
▷ Supports writing simpler rules with a higher level of abstraction.
▷ Makes it easier to improve accuracy over time
▷ But… not for very large rules .
▷ Old method ,but used.
35
Thanks!
Any questions?
36

Seminar dm

  • 1.
  • 2.
    outlines  Purpose ofsearch  Introduction  Applications of text classification  Approaches and methods in text classification  Summary . 2
  • 3.
    Purpose of search Stateof the art for text classification problem 3
  • 4.
  • 5.
    Introduction  TC isone of the important fields in natural language processing.  Text Classification assigns one or more classes to a document according to their content. 5
  • 6.
    Applications of text classification oCRM tasks. o Social media o E-mail spam filtering o Sentiment Analysis o Commercial world o Question answering systems Dialogue Agents o other 6
  • 7.
    Approaches and methodsin text classification 7 
  • 8.
    Rule-base Or rule classification, use rules to classify text. Methods in text classification Statistical use machine learning and deep learning . . 8
  • 9.
    Machine learning fortext classification use BOW as a way of extracting features from text for use in ml algorithms 9
  • 10.
  • 11.
    Machine learning algorithmsfor text classification  Decision Trees.  Support Vector Machine.  Naïve Bayes.  K-Nearest Neighbors.  Hidden Markov model. 11
  • 12.
    Decision Trees  Adecision tree is a tree whose internal nodes are tests and whose leaf nodes are categories .  capable to learn disjunctive expressions and their robustness to noisy data seem convenient for document classification.  learning DT cannot guarantee to return the globally optimal decision tree .  High cost . 12
  • 13.
    Decision Trees ▷ Harrag,El-Qawasmeh & Pichappan :use decision tree for Arabic text classification They suggested hybrid techniques of document frequency threshold by using embedded information gain criterion and the preferable feature selection criterion. ▷ Vateekul & Kubat : worked on Imbalanced, Large Scale, and Multi-label Data , try to reduce these costs by FDT ("fast decision - tree induction") . ▷ Johnson, Oles, Zhang & Goetz (2002) : performed combination of a FDT and a modern method for converting a decision tree to a rule set . 13
  • 14.
    K-Nearest Neighbors ▷ appliedto text categorization in early 90's strong baseline in benchmark evaluations ▷ among top-performing methods in TC evaluations scalable to large TC applications. ▷ Also called: ○ Case-based learning ○ Memory-based learning ○ Lazy learning 14
  • 15.
    K-Nearest Neighbors ▷ Usingonly the closest example to determine the categorization is subject to errors due to: ○ A single atypical example. ○ Noise (i.e. error) in the category label of a single training example. ▷ More robust alternative is to find the k most-similar examples and return the majority category of these k examples. ▷ Value of k is typically odd to avoid ties; 3 and 5 are most common. ▷ No feature selection necessary 15
  • 16.
    KNN  Hierarchical KNN(high performance with small and large dataset) with two steps:  Step1: select high K  Step2 : select neighbor features.  KNN with indexing documents by N-gram (unigrams and bigrams)  KNN(with K-means) for grouping into clusters then Weighted 16
  • 17.
    Naïve Bayes  Simple,common and very fast.  Baseline  Naïve Bayes is not so naïve , A good dependable baseline for text classification (but not the best)!  Very good in Domains with many equally important features.  popular for document categorization.  Conditional independence assumption  Features are independent of each other given in the class.  Need very large training examples. 17
  • 18.
    Naïve Bayes  Singhal& Sharma: eliminating features leads to improved performance.  Posteriori with dependency between features and Reduce dimensions of features.  Use NB without features independence assumption and split related features (high performance with increase dataset). 18
  • 19.
    Hidden Markov model HMM is one sequential model of text .  A simple process to generate a sequence of words.  Classification is not simple.  generate states y1,...,yn  generate words w1,..,wn from Pr(W|Y=yi) 19
  • 20.
    HMM  Frasconi, Soda&Vullo : represent documents as series of pages(high performance with large documents )  Use “Minimum Message length estimator“ for optimal number of states for higher performance . 20
  • 21.
    Support Vector Machine was proposed by Vapnik, provides "a maximal margin separating hyper plane" between two classes of data and has non-linear extensions  represents the text document as a vector .  A popular supervised learning model used for binary classification. ▷ Why SVM? ○ High dimensional input space ○ Few irrelevant features ○ Sparse document vectors 21
  • 22.
    SVM  Yao &Fan : use weighted kernel function depended on features of the training data for interference detection.  Rennie &Rifkin : to the task of classifying multilayered text .  Joseph , Yun and Yanqing(2015) :Use Word2Vector representation with SVM for Semantic Features. 22
  • 23.
    Deep learning fortext classification No feature extraction 23
  • 24.
    Deep learning  In~2010 DL started outperforming other ML techniques .  first in speech and vision, then NLP.  Several big improvements in recent years in NLP .  Leverage different levels of representation.  words & characters.  syntax & semantics 24
  • 25.
    25 o Manually designed featuresare often over- specified, incomplete and take a long time to design and validate. o Learned Features are easy to adapt, fast to learn Deep learning –why? o Can learn both supervised unsupervised and. o Deep learning provides a very flexible, (almost?) universal, learnable framework for representing world, visual and linguistic information.
  • 26.
    Convolution NN  ConvolutionalNeural Networks (CNNs -2014)  Main CNN idea for text: Compute vectors for n-grams and group them afterwards  Use Single 1-dimensional convolution layer followed by a max pooling layer combining neighboring vectors.  Goal is to learn a region based text embedding.  fast in training and powerful in text classification .  learning an optimal kernel size is challenging. 26
  • 27.
  • 28.
    Recurrent NN  RecurrentNN has obtained much attention because of their superior ability to pr Tai et al. (2015) generalized LSTM to Tree-LSTM where each LSTM unit gains information from its children units. reserve sequence information over time.  Has ability to remember long sequence , has forget gates .  Has High cost(O(n2)) 28
  • 29.
    Bidirectional LSTM ▷ Itinvolves duplicating the first recurrent layer in the network. ▷ remarkable performance in sentences more than in documents 29
  • 30.
    Recurrent Convolutional NN(2015) ▷capture contextual information by maintaining a state of all previous inputs. ▷ remarkable performance in documents classification. 30
  • 31.
    AC-BLSTM ▷ Asymmetric ConvolutionalBidirectional LSTM (AC-BLSTM - 2017). ▷ remarkable performance in sentences and documents classification tasks. 31
  • 32.
  • 33.
    Hierarchical Attention Networks ▷HAN(2016). ▷ Assume that a document has L sentences Si and each sentence contains Ti words. ▷ It consists of several parts: ○ a word sequence encoder ○ a word-level attention layer ○ a sentence encoder and ○ a sentence-level attention layer. 33
  • 34.
  • 35.
    Rule base ▷ basedon linguistic rules that capture all of the elements and attributes of a document to assign it to a category. ▷ A rules-based approach is flexible, powerful and easy to express. ▷ Required understanding of text (meaning, relevancy, relationship between concepts, etc.) ▷ Provides a true representation of the language. ▷ Supports writing simpler rules with a higher level of abstraction. ▷ Makes it easier to improve accuracy over time ▷ But… not for very large rules . ▷ Old method ,but used. 35
  • 36.