SlideShare a Scribd company logo
1 of 6
Download to read offline
A Study of Deep Learning on Sentence Classification
By: Trinh Hoang-Trieu
Supervisor: Nguyen Le-Minh
Nguyen lab, School of Information Science
Japan Advanced Institute of Science and Technology
thtrieu@apcs.vn
Abstract
The author study different deep learning models on the task of sentence classification. Dataset used is
TREC. Models under investigation are: Convolutional Neural Networks proposed by Yoon Kim, Long
Short Term Memory, Long Short Term Memory on top of a Convolutional layer, and Convolutional
network augmented with max-pooling position. Results show that architectures with time-retaining
mechanism works better than those do not. The author also propose new changes to the conventional
final layer, namely to replace the last fully connected layer with a Linear Support Vector Machine, or to
replace the discriminative nature of the last layer with a generative one.
1. TREC dataset
TREC dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset
has 6 labels, 50 level-2 labels. Average length of each sentence is 10, vocabulary size of 8700. A
sample from TREC is as follow:
DESC:manner How did serfdom develop in and then leave Russia ?
ENTY:cremat What films featured the character Popeye Doyle ?
DESC:manner How can I find a list of celebrities ' real names ?
ENTY:animal What fowl grabs the spotlight after the Chinese Year of the Monkey ?
ABBR:exp What is the full form of .com ?
HUM:ind What contemptible scoundrel stole the cork from my lunch ?
We also experiment on the Vietnamese TREC dataset, which is a direct translation of the TREC dataset
and thus, also have very similar statistics.
2. Word2Vec dataset
In this experiment, we also used another dataset to support training. The dataset consists of pre-trained
300-dimension embeddings of 3 millions words and phrases [1]. These embeddings will be used as the
representation of input for training. This covers 7500 out of 8700 words present in the TREC
vocabulary.
For the Vietnamese TREC dataset, we used another pre-trained dataset, which covers 4600 words out
of approximately 8700 words in the Vietnamese TREC dataset.
3. Convolutional Neural Networks for Sentence Classification
Yoon Kim proposes a simple yet effective convolutional architecture that achieves remarkable result on
several datasets [2]. The network consists of three layers. The first uses 300 convolutional filters with
varying size to detect 300 different features across the sentence's length. The second performs
maximum operation to summarise the previous detection. The last is a fully connected layer with
softmax output.
Figure 1. Convolutional neural network for sentence classification
The details ofS this network is as follow:
• Layer 1: 3 types of window size 3x300, 4x300, 5x300. 100 feature maps each.
• Layer 2: max-pool-over-time.
• Layer 3: Fully connected with softmax output. Weight norm constrained by 3.
▪ Dropout with p = 0.5
4. First experiment: Long Short Term Memory
The authors first experiment with a simple Long Short Term Memory architecture. We used the many-
to-one scheme to extract features before discriminating on these extracted features by a fully-connected
layer.
Figure 2. Many-to-one structure.
Source:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
The detail of this network is as follow:
- First Layer: Word Embedding
- Second Layer: Long Short Term Memory
- Third Layer: Fully connected with Softmax output
For the second layer, we experimented on three different variations of Long Short Term Memory Cell:
4.a. Gated Recurrent Unit
Figure 3: Gated Recurrent Unit
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
4.b. Basic Long Short Term Memory
Figure 4: Basic Long Short Term Memory
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
4.c. Long Short Term Memory with peepholes
Figure 5: Long Short Term Memory with peepholes
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
5. Long Short Term Memory on top of a Convolutional layer
Instead of max-pool-over-time, the author propose a softer version where the maximum operation is
perform over a smaller region of the feature maps. By this, temporal information is better retained,
max-pool layer does not produce a single value but a sequence instead. After this layer, the information
is then feed into a Long Short Term Memory Cell with many-to-one scheme and a fully connected layer
with softmax output. The authors also experimented with three aforementioned types of LSTM Cell.
Figure 6: Long Short Term Memory on top of a Convolutional Layer
6. Convolutional network augmented with max-pooling position
In this network, besides performing maximum operation at the max-pooling layer, we also apply the
argmax function to produce the position in which this maximum value takes place. By supplying this
position information, the authors expect a better performance since this is also a time-retaining
mechanism, only without any recurrent connection.
To be able to propagate gradient information through this argmax operation, the authors approximate
this argmax operation with another differentiable function:
Figure 7: Convolutional network with max-pooling position
7. Hyper-parameter tuning and Experimental details
The authors use 10-fold cross validation on training set for hyper parameter tuning. The word
embedding input is kept static throughout all model except the first one to replicate Yoon Kim's
experiment. Since the average length of the sentence is only 10, while the maximum length of
sentences in the dataset is approximately 40, we also did not use paddings to avoid feeding the
networks too much noise (otherwise, on average there will be 30 padding words for each 10 words in
the dataset).
Words that do not appear in the pre-train embedding dataset is initialised with variance 0.25 to match
with existing words. For training procedure,we hold-out 5% of the data for early stopping since the
dataset is not too big. Training is carried out in batches of size 50 to speed up convergence.
8. Result
The LSTM Cell with peepholes gives best result when a single recurrent layer is used (described in
Section 4), while Gated Recurrent Unit gives best result in the hybrid model of Convolution and
Recurrent connections (described in Section 5).
Performance of the Convolutional Network augmented with position is very unstable due to the
presence of argmax operator. In our implementation, argmax is approximated with a function involves
very big constant, which in turn results in unreliable gradients and poor result. In our model, the final
layer is a fully connected layer, which is also not the best choice since classification is unlikely a linear
function of features and their position. Other results are reported as follow:
TREC TRECvn
CNN (implemented on our system) 93 91.8
CNN-LSTM 94.2 92.8
LSTM 95.4 94.2
Table 1. Accuracy (%) of different models on two datasets
9. Conclusion
As can be seen, models which utilises temporal information gives better results comparing to those do
not.
10. Future work: Support Vector Machine as the final layer
Of all the proposed model, the final layer is always a fully connected layer with softmax output, which
is equivalent to a Linear Classifier. In this case, the previous layers act as a feature extractor. Good
performance as reported indicates that these feature extractor is able to extract useful features that
represent points that can be separated well by a simple Linear Classifier. It is highly likely that these
features are also useful for other kind of Linear Classifier.
The authors propose using Linear Support Vector Machine as the final layer, this can be done by simply
replacing the usual cross-entropy loss function by Linear Support Vector Machine's loss function.
Namely, instead of using:
(Where is the extracted features from the previous layers)
With Support Vector Machine being the final layer, loss function is now:
Where RELU stands for Rectifier Linear Unit function, t is the target that is either -1 or 1 . This loss
function represent the loss when one-vs-rest scheme is used (since TREC is a multiclass dataset). For
Linear Support Vector with a soft-margin, the loss function is:
11. Reference:
[1] Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint
arXiv:1408.5882 (2014).
[2] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their
compositionality." Advances in neural information processing systems. 2013.

More Related Content

What's hot

RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataYao-Chieh Hu
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief netsbutest
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksSharath TS
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchSatoru Katsumata
 
Lecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural NetworksLecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural NetworksSang Jun Lee
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowAltoros
 
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSoc
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSocPorting MPEG-2 files on CerberO, a framework for FPGA based MPSoc
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSocadnanfaisal
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa ReformerSan Kim
 
Analysis and design of a half hypercube interconnection network topology
Analysis and design of a half hypercube interconnection network topologyAnalysis and design of a half hypercube interconnection network topology
Analysis and design of a half hypercube interconnection network topologyAmir Masoud Sefidian
 
All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations Syed Zaid Irshad
 
Neural Turing Machines
Neural Turing MachinesNeural Turing Machines
Neural Turing MachinesKato Yuzuru
 
Towards Dropout Training for Convolutional Neural Networks
Towards Dropout Training for Convolutional Neural Networks Towards Dropout Training for Convolutional Neural Networks
Towards Dropout Training for Convolutional Neural Networks Mah Sa
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Deep Belief Networks (D2L1 Deep Learning for Speech and Language UPC 2017)
Deep Belief Networks (D2L1 Deep Learning for Speech and Language UPC 2017)Deep Belief Networks (D2L1 Deep Learning for Speech and Language UPC 2017)
Deep Belief Networks (D2L1 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Emnlp2015 reading festival_lstm_cws
Emnlp2015 reading festival_lstm_cwsEmnlp2015 reading festival_lstm_cws
Emnlp2015 reading festival_lstm_cwsAce12358
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)SungminYou
 

What's hot (20)

Multidimensional RNN
Multidimensional RNNMultidimensional RNN
Multidimensional RNN
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief nets
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
Lecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural NetworksLecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural Networks
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
 
Lstm
LstmLstm
Lstm
 
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSoc
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSocPorting MPEG-2 files on CerberO, a framework for FPGA based MPSoc
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSoc
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
 
Rnn & Lstm
Rnn & LstmRnn & Lstm
Rnn & Lstm
 
Bh36352357
Bh36352357Bh36352357
Bh36352357
 
Analysis and design of a half hypercube interconnection network topology
Analysis and design of a half hypercube interconnection network topologyAnalysis and design of a half hypercube interconnection network topology
Analysis and design of a half hypercube interconnection network topology
 
All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations
 
Neural Turing Machines
Neural Turing MachinesNeural Turing Machines
Neural Turing Machines
 
Towards Dropout Training for Convolutional Neural Networks
Towards Dropout Training for Convolutional Neural Networks Towards Dropout Training for Convolutional Neural Networks
Towards Dropout Training for Convolutional Neural Networks
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Deep Belief Networks (D2L1 Deep Learning for Speech and Language UPC 2017)
Deep Belief Networks (D2L1 Deep Learning for Speech and Language UPC 2017)Deep Belief Networks (D2L1 Deep Learning for Speech and Language UPC 2017)
Deep Belief Networks (D2L1 Deep Learning for Speech and Language UPC 2017)
 
Emnlp2015 reading festival_lstm_cws
Emnlp2015 reading festival_lstm_cwsEmnlp2015 reading festival_lstm_cws
Emnlp2015 reading festival_lstm_cws
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
 

Similar to Deep Learning Sentence Classification Models on TREC Dataset

Applying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesApplying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesYannis Flet-Berliac
 
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...ali hassan
 
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Lviv Startup Club
 
Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)Olusola Amusan
 
Local Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptxLocal Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptxlwz614595250
 
Associative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networksAssociative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networkseSAT Publishing House
 
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...Virendra Uppalwar
 
Intel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIntel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIlya Kryukov
 
DSP IEEE paper
DSP IEEE paperDSP IEEE paper
DSP IEEE paperprreiya
 
SCALING THE HTM SPATIAL POOLER
SCALING THE HTM SPATIAL POOLERSCALING THE HTM SPATIAL POOLER
SCALING THE HTM SPATIAL POOLERijaia
 
Fuzzy Logic Approach to Improving Stable Election Protocol for Clustered Hete...
Fuzzy Logic Approach to Improving Stable Election Protocol for Clustered Hete...Fuzzy Logic Approach to Improving Stable Election Protocol for Clustered Hete...
Fuzzy Logic Approach to Improving Stable Election Protocol for Clustered Hete...chokrio
 
Lifelong Learning for Dynamically Expandable Networks
Lifelong Learning for Dynamically Expandable NetworksLifelong Learning for Dynamically Expandable Networks
Lifelong Learning for Dynamically Expandable NetworksNAVER Engineering
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...Pooyan Jamshidi
 

Similar to Deep Learning Sentence Classification Models on TREC Dataset (20)

Applying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesApplying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language Services
 
shashank_spdp1993_00395543
shashank_spdp1993_00395543shashank_spdp1993_00395543
shashank_spdp1993_00395543
 
shashank_micro92_00697015
shashank_micro92_00697015shashank_micro92_00697015
shashank_micro92_00697015
 
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
 
Conformer review
Conformer reviewConformer review
Conformer review
 
D028018022
D028018022D028018022
D028018022
 
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
 
Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)
 
Local Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptxLocal Applications of Large Language Models based on RAG.pptx
Local Applications of Large Language Models based on RAG.pptx
 
Associative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networksAssociative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networks
 
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
“Design of Efficient Mobile Femtocell by Compression and Aggregation Technolo...
 
Intel Cluster Poisson Solver Library
Intel Cluster Poisson Solver LibraryIntel Cluster Poisson Solver Library
Intel Cluster Poisson Solver Library
 
shashank_hpca1995_00386533
shashank_hpca1995_00386533shashank_hpca1995_00386533
shashank_hpca1995_00386533
 
DSP IEEE paper
DSP IEEE paperDSP IEEE paper
DSP IEEE paper
 
SCALING THE HTM SPATIAL POOLER
SCALING THE HTM SPATIAL POOLERSCALING THE HTM SPATIAL POOLER
SCALING THE HTM SPATIAL POOLER
 
Networking Articles Overview
Networking Articles OverviewNetworking Articles Overview
Networking Articles Overview
 
Fuzzy Logic Approach to Improving Stable Election Protocol for Clustered Hete...
Fuzzy Logic Approach to Improving Stable Election Protocol for Clustered Hete...Fuzzy Logic Approach to Improving Stable Election Protocol for Clustered Hete...
Fuzzy Logic Approach to Improving Stable Election Protocol for Clustered Hete...
 
N ns 1
N ns 1N ns 1
N ns 1
 
Lifelong Learning for Dynamically Expandable Networks
Lifelong Learning for Dynamically Expandable NetworksLifelong Learning for Dynamically Expandable Networks
Lifelong Learning for Dynamically Expandable Networks
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
 

Recently uploaded

Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)Jshifa
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 

Recently uploaded (20)

9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 

Deep Learning Sentence Classification Models on TREC Dataset

  • 1. A Study of Deep Learning on Sentence Classification By: Trinh Hoang-Trieu Supervisor: Nguyen Le-Minh Nguyen lab, School of Information Science Japan Advanced Institute of Science and Technology thtrieu@apcs.vn Abstract The author study different deep learning models on the task of sentence classification. Dataset used is TREC. Models under investigation are: Convolutional Neural Networks proposed by Yoon Kim, Long Short Term Memory, Long Short Term Memory on top of a Convolutional layer, and Convolutional network augmented with max-pooling position. Results show that architectures with time-retaining mechanism works better than those do not. The author also propose new changes to the conventional final layer, namely to replace the last fully connected layer with a Linear Support Vector Machine, or to replace the discriminative nature of the last layer with a generative one. 1. TREC dataset TREC dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset has 6 labels, 50 level-2 labels. Average length of each sentence is 10, vocabulary size of 8700. A sample from TREC is as follow: DESC:manner How did serfdom develop in and then leave Russia ? ENTY:cremat What films featured the character Popeye Doyle ? DESC:manner How can I find a list of celebrities ' real names ? ENTY:animal What fowl grabs the spotlight after the Chinese Year of the Monkey ? ABBR:exp What is the full form of .com ? HUM:ind What contemptible scoundrel stole the cork from my lunch ? We also experiment on the Vietnamese TREC dataset, which is a direct translation of the TREC dataset and thus, also have very similar statistics. 2. Word2Vec dataset In this experiment, we also used another dataset to support training. The dataset consists of pre-trained 300-dimension embeddings of 3 millions words and phrases [1]. These embeddings will be used as the representation of input for training. This covers 7500 out of 8700 words present in the TREC vocabulary. For the Vietnamese TREC dataset, we used another pre-trained dataset, which covers 4600 words out of approximately 8700 words in the Vietnamese TREC dataset.
  • 2. 3. Convolutional Neural Networks for Sentence Classification Yoon Kim proposes a simple yet effective convolutional architecture that achieves remarkable result on several datasets [2]. The network consists of three layers. The first uses 300 convolutional filters with varying size to detect 300 different features across the sentence's length. The second performs maximum operation to summarise the previous detection. The last is a fully connected layer with softmax output. Figure 1. Convolutional neural network for sentence classification The details ofS this network is as follow: • Layer 1: 3 types of window size 3x300, 4x300, 5x300. 100 feature maps each. • Layer 2: max-pool-over-time. • Layer 3: Fully connected with softmax output. Weight norm constrained by 3. ▪ Dropout with p = 0.5 4. First experiment: Long Short Term Memory The authors first experiment with a simple Long Short Term Memory architecture. We used the many- to-one scheme to extract features before discriminating on these extracted features by a fully-connected layer. Figure 2. Many-to-one structure. Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
  • 3. The detail of this network is as follow: - First Layer: Word Embedding - Second Layer: Long Short Term Memory - Third Layer: Fully connected with Softmax output For the second layer, we experimented on three different variations of Long Short Term Memory Cell: 4.a. Gated Recurrent Unit Figure 3: Gated Recurrent Unit Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 4.b. Basic Long Short Term Memory Figure 4: Basic Long Short Term Memory Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  • 4. 4.c. Long Short Term Memory with peepholes Figure 5: Long Short Term Memory with peepholes Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 5. Long Short Term Memory on top of a Convolutional layer Instead of max-pool-over-time, the author propose a softer version where the maximum operation is perform over a smaller region of the feature maps. By this, temporal information is better retained, max-pool layer does not produce a single value but a sequence instead. After this layer, the information is then feed into a Long Short Term Memory Cell with many-to-one scheme and a fully connected layer with softmax output. The authors also experimented with three aforementioned types of LSTM Cell. Figure 6: Long Short Term Memory on top of a Convolutional Layer 6. Convolutional network augmented with max-pooling position In this network, besides performing maximum operation at the max-pooling layer, we also apply the argmax function to produce the position in which this maximum value takes place. By supplying this position information, the authors expect a better performance since this is also a time-retaining mechanism, only without any recurrent connection. To be able to propagate gradient information through this argmax operation, the authors approximate this argmax operation with another differentiable function:
  • 5. Figure 7: Convolutional network with max-pooling position 7. Hyper-parameter tuning and Experimental details The authors use 10-fold cross validation on training set for hyper parameter tuning. The word embedding input is kept static throughout all model except the first one to replicate Yoon Kim's experiment. Since the average length of the sentence is only 10, while the maximum length of sentences in the dataset is approximately 40, we also did not use paddings to avoid feeding the networks too much noise (otherwise, on average there will be 30 padding words for each 10 words in the dataset). Words that do not appear in the pre-train embedding dataset is initialised with variance 0.25 to match with existing words. For training procedure,we hold-out 5% of the data for early stopping since the dataset is not too big. Training is carried out in batches of size 50 to speed up convergence. 8. Result The LSTM Cell with peepholes gives best result when a single recurrent layer is used (described in Section 4), while Gated Recurrent Unit gives best result in the hybrid model of Convolution and Recurrent connections (described in Section 5). Performance of the Convolutional Network augmented with position is very unstable due to the presence of argmax operator. In our implementation, argmax is approximated with a function involves very big constant, which in turn results in unreliable gradients and poor result. In our model, the final layer is a fully connected layer, which is also not the best choice since classification is unlikely a linear function of features and their position. Other results are reported as follow: TREC TRECvn CNN (implemented on our system) 93 91.8 CNN-LSTM 94.2 92.8 LSTM 95.4 94.2 Table 1. Accuracy (%) of different models on two datasets
  • 6. 9. Conclusion As can be seen, models which utilises temporal information gives better results comparing to those do not. 10. Future work: Support Vector Machine as the final layer Of all the proposed model, the final layer is always a fully connected layer with softmax output, which is equivalent to a Linear Classifier. In this case, the previous layers act as a feature extractor. Good performance as reported indicates that these feature extractor is able to extract useful features that represent points that can be separated well by a simple Linear Classifier. It is highly likely that these features are also useful for other kind of Linear Classifier. The authors propose using Linear Support Vector Machine as the final layer, this can be done by simply replacing the usual cross-entropy loss function by Linear Support Vector Machine's loss function. Namely, instead of using: (Where is the extracted features from the previous layers) With Support Vector Machine being the final layer, loss function is now: Where RELU stands for Rectifier Linear Unit function, t is the target that is either -1 or 1 . This loss function represent the loss when one-vs-rest scheme is used (since TREC is a multiclass dataset). For Linear Support Vector with a soft-margin, the loss function is: 11. Reference: [1] Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014). [2] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.