Learning Non-Linear Functions for Text
Classification
Cohan Sujay Carlos1
Geetanjali Rakshit2
Aiaioo Labs, Bangalore, India
IIT-B Monash Research Academy, Mumbai, India
December 20, 2016
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Outline
1 Background
2 3-Layer Bayesian Models
Product of Sums
Sum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Outline
1 Background
2 3-Layer Bayesian Models
Product of Sums
Sum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
XOR
E
T
0 1
0
1
e
eu
u(0,0)
(1,1)(0,1)
(1,0)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
XOR
E
T
0 1
0
1
e
eu
u(0,0)
(1,1)(0,1)
(1,0)
d
d
d
d
d
d
d
d
d
d
dd
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
XOR
E
T
0 1
0
1
e
eu
u(0,0)
(1,1)(0,1)
(1,0) 
 
 
 
 
 
 
 
 
 
  
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Motivation
Multilayer neural networks have been shown to be capable of
learning non-linear boundaries.
Questions:
Is there something special about neural networks?
Can probabilistic classifiers also learn non-linear
boundaries?
In particular, can multilayer Bayesian models do so and
under what conditions?
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Bayesian Models = Directed PGMs
x1x2
x4 x3
j
jj
jE
T
E
The joint probabilities factorise according to Equation 1.
P(x1, x2 . . . xn) =
1≤i≤n
P(xi |xa(i)) (1)
where a(i) means ancestors of node i.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Naive Bayes
C
F j
j
c
The naive Bayes classifier is a special case two layers deep.
P(F, C) = P(F|C)P(C) =
f ∈F
P(f |C)P(C) (2)
The root node represents a class label from set
C = {c1, c2 . . . ck} and the leaf nodes the set of features
F = {f1, f2 . . . fi }.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Outline
1 Background
2 3-Layer Bayesian Models
Product of Sums
Sum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
3-Layer Bayesian Models
C
H
F j
j
j
c
c
In a 3-layer Bayesian model, there is also a set of hidden nodes
H = {h1, h2 . . . hj }.
P(F, H, C) = P(F|H, C)P(H|C)P(C) (3)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
3-Layer Bayesian Models
c
h
F j
j
j
c
c
From P(F, h, c) = P(F|h, c)P(h|c)P(c) we obtain 4 and 5.
P(F, c) =
h
P(F|h, c)P(h|c)P(c) (4)
P(F|c) =
h
P(F|h, c)P(h|c) (5)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
3-Layer Bayesian Models
c1
h1
f1
j
j
j
c
c
h2
j
 
  ©
d
dd‚
Starting from f instead of F, we get 8.
P(f , h, c) = P(f |h, c)P(h|c)P(c) (6)
P(f , c) =
h
P(f |h, c)P(h|c)P(c) (7)
P(f |c) =
h
P(f |h, c)P(h|c) (8)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Product of Sums
You can derive the likelihood as follows.
P(F|c) =
f
P(f |c) (9)
Substituting Equation 8 into Equation 9, we get Equation 13.
P(F|c) =
f h
P(f |h, c)P(h|c) (10)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Sum of Products
You can also derive the likelihood as follows.
P(F|h, c) =
f
P(f |h, c) (11)
Substituting Equation 11 into Equation 5, we get Equation 14.
P(F|c) =
h f
P(f |h, c) P(h|c) (12)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Two Likelihood Equations
Product of Sums
P(F|c) =
f h
P(f |h, c)P(h|c) (13)
Sum of Products
P(F|c) =
h f
P(f |h, c) P(h|c) (14)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Outline
1 Background
2 3-Layer Bayesian Models
Product of Sums
Sum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Proof of Linearity
If any classifier can be proved to be equivalent to a
multinomial naive Bayes classifier, then it can only learn linear
decision boundaries.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Proof of Linearity
Consider a toy dataset consisting of two sentences, both
belonging to the class Politics (P):
The United States and (TUSa)
The United Nations (TUN)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Proof of Linearity
One of the parameters a naive Bayes classifier would learn is
P(The|Politics) = z.
P
T
z = 2
7
TUSa
TUN
j
j
c
If a directed PGM learns the same likelihoods, which is a
product of these parameters, it can only learn linear decision
boundaries.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Proof of Linearity - Hard EM
The parameters of a multinomial 3-layer Bayesian classifier:
P TUSa
TUN
h1 h2
T
a = 1
4
b = 1
3
m = 4
7
n = 3
7
TUNTUSa
j
j j
j
d
d
d
d
dd‚
 
 
 
 
  ©
d
d
d
d
dd‚
 
 
 
 
  ©
From Equation 13, z = am + bn = 1
4
4
7
+ 1
3
3
7
= 2
7
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Proof of Linearity - Hard EM
The parameters of a multinomial 3-layer Bayesian classifier:
P TUSa
TUN
h1 h2
T
a = 0
0
b = 2
7
m = 0
7
n = 7
7
TUSa
TUN
j
j j
j
d
d
d
d
dd‚
 
 
 
 
  ©
d
d
d
d
dd‚
 
 
 
 
  ©
From Equation 13, z = am + bn = 0
0
0
7
+ 2
7
7
7
= 2
7
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Proof of Linearity - Soft EM
The parameters of a multinomial 3-layer Bayesian classifier:
P TUSa
TUN
h1 h2
T
a =
4
5
+1
5
3.8
b =
1
5
+4
5
3.2
m = 4
5
4
7
+ 1
5
3
7
= 3.8
7
n = 1
5
4
7
+ 4
5
3
7
= 3.2
7
TUSa 4
5
TUN 1
5
TUSa 1
5
TUN 4
5
j
j j
j
d
d
d
d
dd‚
 
 
 
 
  ©
d
d
d
d
dd‚
 
 
 
 
  ©
From Equation 13, z = am + bn = 1
3.8
3.8
7
+ 1
3.2
3.2
7
= 2
7
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Outline
1 Background
2 3-Layer Bayesian Models
Product of Sums
Sum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Shifted and Scaled XOR
E
T
1 3 5
1
3
5
e
eu
u(0,0)
(5,5)(1,5)
(5,1)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Sum of Products - Multinomial
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Sum of Products - Gaussian
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Outline
1 Background
2 3-Layer Bayesian Models
Product of Sums
Sum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Accuracy on Reuters & 20 Ng
Classifier R8 R52 20Ng
Naive Bayes 0.955 0.916 0.797
MaxEnt 0.953 0.866 0.793
SVM 0.946 0.864 0.711
Non-linear 3-Layer Bayes 0.964 0.921 0.808
Table: Accuracy on Reuters R8, R52 and 20 Newsgroups datasets.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Accuracy on Movie Reviews
Classifier Movie Reviews
Naive Bayes 0.7964 ± 0.0136
MaxEnt 0.8355 ± 0.0149
SVM 0.7830 ± 0.0128
Non-linear 3-Layer Bayes 0.8438 ±0.0110
Table: Accuracy on Large Movie Review dataset.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Accuracy on Movie Reviews
5 10 15 20 25 30 35
0.75
0.8
0.85
0.9
1000s of training reviews
Accuracy
Naive Bayes
Non-Linear Hierarchical Bayes
MaxEnt
SVM
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Accuracy on Movie Reviews
5 10 15 20 25 30 35
0.8
0.85
0.9
1000s of training reviews
Accuracy
1 hidden node per class
2 hidden nodes per class
3 hidden nodes per class
4 hidden nodes per class
5 hidden nodes per class
B
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
On the Query Classification Dataset
Classifier Accuracy
Naive Bayes 0.721 ± 0.018
MaxEnt 0.667 ± 0.028
SVM 0.735 ± 0.032
Non-Linear 3-Layer Bayes 0.711 ± 0.032
Table: Accuracy on query classification.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Accuracy on Artificial Dataset
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Accuracy on Artificial Dataset
Shape GNB GHB
ring 0.664 0.949
dots 0.527 0.926
XOR 0.560 0.985
S 0.770 0.973
Table: Accuracy on the artificial dataset.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Outline
1 Background
2 3-Layer Bayesian Models
Product of Sums
Sum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Conclusions
We have shown that generative classifiers with a hidden
layer are capable of learning non-linear decision
boundaries under the right conditions (independence
assumptions), and therefore might be said to be capable
of deep learning.
We have also shown experimentally that multinomial
non-linear 3-layer Bayesian classifiers can outperform
some linear classification algorithms on some text
classification tasks.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Errata
In our survey of prior work ...
Hierarchical Bayesian models are directed probabilistic
graphical models in which the directed acyclic graph is a tree
(each node that is not the root node has only one immediate
parent). In these models the ancestors a(i) of any node i do
not contain any siblings (every pair of ancestors is also in an
ancestor-descendent relationship).
The above statement which is found in our paper is incorrect.
Hierarchical Bayesian models including ours are generally
speaking DAGs.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Future Work
Comparison of performance with deep neural networks;
evaluation on image processing tasks; other training
algorithms - Gibbs’ sampling (MCMC), variational
methods.
Going from 3 layers to higher numbers of layers (solving
the “vanishing information” problem).
Integration with larger models, like sequential models
(HMMs) and translation models (alignment models or
sequence to sequence models).
Probabilistic embeddings.
Unified model covering Bayesian models and neural
networks.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
Thank you!
Cohan Sujay Carlos & Geetanjali Rakshit
cohan@aiaioo.com
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification

Learning Non-Linear Functions for Text Classification

  • 1.
    Learning Non-Linear Functionsfor Text Classification Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India IIT-B Monash Research Academy, Mumbai, India December 20, 2016 Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 2.
    Outline 1 Background 2 3-LayerBayesian Models Product of Sums Sum of Products 3 Linearity of “Product of Sums” 4 Non-Linearity of “Sum of Products” 5 Experimental Results 6 Conclusions Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 3.
    Outline 1 Background 2 3-LayerBayesian Models Product of Sums Sum of Products 3 Linearity of “Product of Sums” 4 Non-Linearity of “Sum of Products” 5 Experimental Results 6 Conclusions Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 4.
    XOR E T 0 1 0 1 e eu u(0,0) (1,1)(0,1) (1,0) Cohan SujayCarlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 5.
    XOR E T 0 1 0 1 e eu u(0,0) (1,1)(0,1) (1,0) d d d d d d d d d d dd Cohan SujayCarlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 6.
    XOR E T 0 1 0 1 e eu u(0,0) (1,1)(0,1) (1,0)                       Cohan SujayCarlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 7.
    Motivation Multilayer neural networkshave been shown to be capable of learning non-linear boundaries. Questions: Is there something special about neural networks? Can probabilistic classifiers also learn non-linear boundaries? In particular, can multilayer Bayesian models do so and under what conditions? Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 8.
    Bayesian Models =Directed PGMs x1x2 x4 x3 j jj jE T E The joint probabilities factorise according to Equation 1. P(x1, x2 . . . xn) = 1≤i≤n P(xi |xa(i)) (1) where a(i) means ancestors of node i. Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 9.
    Naive Bayes C F j j c Thenaive Bayes classifier is a special case two layers deep. P(F, C) = P(F|C)P(C) = f ∈F P(f |C)P(C) (2) The root node represents a class label from set C = {c1, c2 . . . ck} and the leaf nodes the set of features F = {f1, f2 . . . fi }. Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 10.
    Outline 1 Background 2 3-LayerBayesian Models Product of Sums Sum of Products 3 Linearity of “Product of Sums” 4 Non-Linearity of “Sum of Products” 5 Experimental Results 6 Conclusions Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 11.
    3-Layer Bayesian Models C H Fj j j c c In a 3-layer Bayesian model, there is also a set of hidden nodes H = {h1, h2 . . . hj }. P(F, H, C) = P(F|H, C)P(H|C)P(C) (3) Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 12.
    3-Layer Bayesian Models c h Fj j j c c From P(F, h, c) = P(F|h, c)P(h|c)P(c) we obtain 4 and 5. P(F, c) = h P(F|h, c)P(h|c)P(c) (4) P(F|c) = h P(F|h, c)P(h|c) (5) Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 13.
    3-Layer Bayesian Models c1 h1 f1 j j j c c h2 j     © d dd‚ Startingfrom f instead of F, we get 8. P(f , h, c) = P(f |h, c)P(h|c)P(c) (6) P(f , c) = h P(f |h, c)P(h|c)P(c) (7) P(f |c) = h P(f |h, c)P(h|c) (8) Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 14.
    Product of Sums Youcan derive the likelihood as follows. P(F|c) = f P(f |c) (9) Substituting Equation 8 into Equation 9, we get Equation 13. P(F|c) = f h P(f |h, c)P(h|c) (10) Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 15.
    Sum of Products Youcan also derive the likelihood as follows. P(F|h, c) = f P(f |h, c) (11) Substituting Equation 11 into Equation 5, we get Equation 14. P(F|c) = h f P(f |h, c) P(h|c) (12) Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 16.
    Two Likelihood Equations Productof Sums P(F|c) = f h P(f |h, c)P(h|c) (13) Sum of Products P(F|c) = h f P(f |h, c) P(h|c) (14) Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 17.
    Outline 1 Background 2 3-LayerBayesian Models Product of Sums Sum of Products 3 Linearity of “Product of Sums” 4 Non-Linearity of “Sum of Products” 5 Experimental Results 6 Conclusions Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 18.
    Proof of Linearity Ifany classifier can be proved to be equivalent to a multinomial naive Bayes classifier, then it can only learn linear decision boundaries. Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 19.
    Proof of Linearity Considera toy dataset consisting of two sentences, both belonging to the class Politics (P): The United States and (TUSa) The United Nations (TUN) Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 20.
    Proof of Linearity Oneof the parameters a naive Bayes classifier would learn is P(The|Politics) = z. P T z = 2 7 TUSa TUN j j c If a directed PGM learns the same likelihoods, which is a product of these parameters, it can only learn linear decision boundaries. Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 21.
    Proof of Linearity- Hard EM The parameters of a multinomial 3-layer Bayesian classifier: P TUSa TUN h1 h2 T a = 1 4 b = 1 3 m = 4 7 n = 3 7 TUNTUSa j j j j d d d d dd‚           © d d d d dd‚           © From Equation 13, z = am + bn = 1 4 4 7 + 1 3 3 7 = 2 7 Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 22.
    Proof of Linearity- Hard EM The parameters of a multinomial 3-layer Bayesian classifier: P TUSa TUN h1 h2 T a = 0 0 b = 2 7 m = 0 7 n = 7 7 TUSa TUN j j j j d d d d dd‚           © d d d d dd‚           © From Equation 13, z = am + bn = 0 0 0 7 + 2 7 7 7 = 2 7 Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 23.
    Proof of Linearity- Soft EM The parameters of a multinomial 3-layer Bayesian classifier: P TUSa TUN h1 h2 T a = 4 5 +1 5 3.8 b = 1 5 +4 5 3.2 m = 4 5 4 7 + 1 5 3 7 = 3.8 7 n = 1 5 4 7 + 4 5 3 7 = 3.2 7 TUSa 4 5 TUN 1 5 TUSa 1 5 TUN 4 5 j j j j d d d d dd‚           © d d d d dd‚           © From Equation 13, z = am + bn = 1 3.8 3.8 7 + 1 3.2 3.2 7 = 2 7 Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 24.
    Outline 1 Background 2 3-LayerBayesian Models Product of Sums Sum of Products 3 Linearity of “Product of Sums” 4 Non-Linearity of “Sum of Products” 5 Experimental Results 6 Conclusions Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 25.
    Shifted and ScaledXOR E T 1 3 5 1 3 5 e eu u(0,0) (5,5)(1,5) (5,1) Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 26.
    Sum of Products- Multinomial Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 27.
    Sum of Products- Gaussian Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 28.
    Outline 1 Background 2 3-LayerBayesian Models Product of Sums Sum of Products 3 Linearity of “Product of Sums” 4 Non-Linearity of “Sum of Products” 5 Experimental Results 6 Conclusions Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 29.
    Accuracy on Reuters& 20 Ng Classifier R8 R52 20Ng Naive Bayes 0.955 0.916 0.797 MaxEnt 0.953 0.866 0.793 SVM 0.946 0.864 0.711 Non-linear 3-Layer Bayes 0.964 0.921 0.808 Table: Accuracy on Reuters R8, R52 and 20 Newsgroups datasets. Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 30.
    Accuracy on MovieReviews Classifier Movie Reviews Naive Bayes 0.7964 ± 0.0136 MaxEnt 0.8355 ± 0.0149 SVM 0.7830 ± 0.0128 Non-linear 3-Layer Bayes 0.8438 ±0.0110 Table: Accuracy on Large Movie Review dataset. Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 31.
    Accuracy on MovieReviews 5 10 15 20 25 30 35 0.75 0.8 0.85 0.9 1000s of training reviews Accuracy Naive Bayes Non-Linear Hierarchical Bayes MaxEnt SVM Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 32.
    Accuracy on MovieReviews 5 10 15 20 25 30 35 0.8 0.85 0.9 1000s of training reviews Accuracy 1 hidden node per class 2 hidden nodes per class 3 hidden nodes per class 4 hidden nodes per class 5 hidden nodes per class B Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 33.
    On the QueryClassification Dataset Classifier Accuracy Naive Bayes 0.721 ± 0.018 MaxEnt 0.667 ± 0.028 SVM 0.735 ± 0.032 Non-Linear 3-Layer Bayes 0.711 ± 0.032 Table: Accuracy on query classification. Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 34.
    Accuracy on ArtificialDataset Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 35.
    Accuracy on ArtificialDataset Shape GNB GHB ring 0.664 0.949 dots 0.527 0.926 XOR 0.560 0.985 S 0.770 0.973 Table: Accuracy on the artificial dataset. Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 36.
    Outline 1 Background 2 3-LayerBayesian Models Product of Sums Sum of Products 3 Linearity of “Product of Sums” 4 Non-Linearity of “Sum of Products” 5 Experimental Results 6 Conclusions Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 37.
    Conclusions We have shownthat generative classifiers with a hidden layer are capable of learning non-linear decision boundaries under the right conditions (independence assumptions), and therefore might be said to be capable of deep learning. We have also shown experimentally that multinomial non-linear 3-layer Bayesian classifiers can outperform some linear classification algorithms on some text classification tasks. Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 38.
    Errata In our surveyof prior work ... Hierarchical Bayesian models are directed probabilistic graphical models in which the directed acyclic graph is a tree (each node that is not the root node has only one immediate parent). In these models the ancestors a(i) of any node i do not contain any siblings (every pair of ancestors is also in an ancestor-descendent relationship). The above statement which is found in our paper is incorrect. Hierarchical Bayesian models including ours are generally speaking DAGs. Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 39.
    Future Work Comparison ofperformance with deep neural networks; evaluation on image processing tasks; other training algorithms - Gibbs’ sampling (MCMC), variational methods. Going from 3 layers to higher numbers of layers (solving the “vanishing information” problem). Integration with larger models, like sequential models (HMMs) and translation models (alignment models or sequence to sequence models). Probabilistic embeddings. Unified model covering Bayesian models and neural networks. Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification
  • 40.
    Thank you! Cohan SujayCarlos & Geetanjali Rakshit cohan@aiaioo.com Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification