Plagiarism detection is difficult since there can be changes made to a sentence at several levels, namely, lexical, semantic, and syntactic level, to construct a paraphrased or plagiarized sentence posing as original. This project presents a novel Supervised Machine Learning Classification Paraphrase Detection System developed by conducting experiments using Microsoft Research Paraphrase (MSRP) Corpus and assessed on the same. The proposed paraphrase detection system has achieved comparable performance with existing paraphrase detection systems. The major contributions of this project are the utilization of a unique combination of lexical, semantic, and syntactic features, utilization of Shapley Additive Explanations (SHAP) Feature Importance Plots in XGBoost, and application of a soft voting classifier comprising of the top 3 performing standalone machine learning classifiers on the training dataset of MSRP Corpus. Another major contribution of the project is the finding that applying data augmentation techniques degrades the performance of machine learning classifiers.
Displacement, Velocity, Acceleration, and Second Derivatives
A Novel Approach for Developing Paraphrase Detection System using Machine Learning
1. A NOVEL APPROACH FOR
DEVELOPING PARAPHRASE
DETECTION SYSTEM USING
MACHINE LEARNING
RUDRADITYO SAHA (16BCE1062)
GUIDE: Dr. BHARADWAJA KUMAR. G
2. INTRODUCTION
• The advent of Internet has made easy and instant access to
vast amounts of information possible.
• The downside to this is that this has led to significant increase
in incidents of plagiarism.
• Detecting plagiarism is a difficult task since there can be
numerous major or minor changes made to a sentence at
several levels, namely, lexical, semantic, and syntactic level, to
construct a paraphrased or plagiarized sentence posing as
original.
2
3. LITERATURE REVIEW
• An in-depth Literature Review has been carried out in the domains of
paraphrase detection systems, and performance enhancement
techniques for paraphrase detection (23 research papers have been
referenced).
• 11 paraphrase detection systems have been explored in Literature
Review of which some of the major ones have been briefly described
below:
Kozareva & Montoyo (2006) have used a voting classifier composed of 3
individual classifiers namely, support vector machine, k-nearest neighbors and
logistic regression for performing paraphrase detection using various lexical
and semantic features. Some features utilized are ratio between common
consecutive n-grams between two sentences and total number of words in the
two sentences, Longest Common Subsequence and a WordNet based
feature.
Finch, Hwang, & Sumita (2005) have used machine translation evaluation
techniques such as Bilingual Evaluation Understudy (BLEU) and Word Error
Rate (WER) as features to be used as input to a support vector machine
classifier for paraphrase detection.
3
4. LITERATURE REVIEW (CONTINUED)
Wan, Dras, Dale, & Paris (2006) have used various lexical features such
as n-gram based features, BLEU based features and syntactic features
such as dependency relation based features. A support vector machine
classifier has been used to classify a sentence pair into either a
paraphrastic pair or a non-paraphrastic pair with the help of such
features.
Chitra & Rajkumar (2013) have developed a support vector machine
based paraphrase recognition system with the help of various lexical,
syntactic and semantic features to detect paraphrases.
Some features utilized are Longest Common Subsequence, BLEU based
features, skipgram based features, and Dependency Relation Overlap.
A feature selection technique namely, genetic algorithm is utilized to not
only improve accuracy of suggested paraphrase recognition method but to
also reduce number of features.
The study carried out by (El Desouki & Gomaa, 2019) provides a
comprehensive discussion on the recent research carried out on
paraphrase detection methods categorized broadly under three
categories namely, unsupervised learning, supervised machine learning
and supervised deep learning.
4
5. LITERATURE REVIEW (CONTINUED)
Since the proposed paraphrase detection method is a machine learning-
based paraphrase detection method, comparison will be done with
machine learning based paraphrase detection methods mentioned by El
Desouki & Gomaa (2019).
• Literature Review is also carried out to explore performance
enhancement techniques for paraphrase detection methods which are
as below:
The study carried out by (Uysal & Gunal, 2014) explores the impact of
commonly used preprocessing tasks such as removing stop-words,
lowercasing and stemming in all possible combinations in two
languages, namely Turkish and English, on two different domains,
specifically news and e-mails for text classification by keeping in
consideration aspects such as accuracy and dimension size.
The study concludes that text preprocessing in text classification is as
important as feature selection for improving performance of classifier, some
text preprocessing tasks such as lowercasing improve classification
performance regardless of domain and language, all possible combinations
of preprocessing tasks must be tried for finding out the best combination,
and removing stop words leads to decrease in performance of classifier.
5
6. LITERATURE REVIEW (CONTINUED)
Wei and Zou (2019) have presented Easy Data Augmentation
(EDA) techniques namely, random insertion, random swap,
random deletion, and synonym replacement for improving text
classification performance by augmenting or adding textual
data to the training dataset of a corpus.
Wei and Zou (2019) have shown that EDA enhances performance
for deep learning classifiers such as convolutional neural
networks for performing classification.
EDA is shown to have performed particularly well for small
datasets.
6
7. RESEARCH GAP
• Some combinations of various lexical, semantic, and syntactic
features, to the best of my knowledge, have not been explored
in the same feature-set. For example, semantic features such
as Word2Vec Embeddings based Features, Sentence
Embeddings based Features and Features based on WordNet
Similarity Measures have not been explored in the same
feature-set.
• Data Augmentation Techniques (Wei & Zou, 2019), to the best
of my knowledge, have not been explored to improve
performance of machine learning classifiers for paraphrase
detection.
• Feature Selection Techniques based on Feature Importance
Plots given by SHAP (Lundberg, & Lee, 2017), to the best of
my knowledge, have not been explored to improve paraphrase
detection performance.
7
8. • To develop a novel Supervised Machine Learning
Classification Paraphrase Detection System, with paraphrase
detection performance comparable to existing paraphrase
detection systems.
• To improve performance of proposed Supervised Machine
Learning Classification Paraphrase Detection System by taking
the following steps:
Use of a unique combination of lexical, semantic, and
syntactic features.
Utilization of SHAP Feature Importance Plots in XGBoost as a
feature selection technique so as to remove the features with
zero contribution.
Application of Data Augmentation Techniques (Wei & Zou,
2019) on training dataset.
Use of a soft voting classifier comprising of top 3 performing
standalone classifiers on training dataset.
OBJECTIVE
8
9. • For developing a Supervised Machine Learning Classification Paraphrase
Detection System, 5 experiments are conducted using Microsoft Research
Paraphrase Corpus (MSRP Corpus) (Quirk, Brockett, & Dolan, 2004a,b).
• The MSRP Corpus consists of 5801 sentence pairs out of which 4076
sentence pairs are present in training dataset and 1725 sentence pairs are
present in test dataset.
• Since, paraphrase detection is a binary classification problem; each sentence
pair is assigned a value “1” or “0” as quality with “1” indicating that the
sentences in the sentence pair are paraphrases and “0” indicating that the
sentences in the sentence pair are not paraphrases.
• In Experiments 1, 2, and 3, the performance of various standalone classifiers
and a soft voting classifier are assessed with the help of various lexical,
semantic, and syntactic features, respectively.
• In Experiment 4, a feature-set composed of various lexical, semantic, and
syntactic features are utilized for evaluating performance of various standalone
classifiers and a soft voting classifier.
• In Experiment 5, Data Augmentation Techniques (Wei & Zou, 2019) are
applied on the training dataset of MSRP Corpus to improve the performance of
various standalone classifiers and a soft voting classifier using the feature-set
utilized in Experiment 4.
RESEARCH METHODOLOGY
9
10. • From Experiments 1, 2, and 3, the features with zero contribution
found with the help of SHAP Feature Importance Plot in XGBoost
are excluded from the feature-set utilized in Experiment 4.
• A SHAP Feature Importance Plot gives the importance values of
various features in descending order present in the feature-set
using the mean (|Tree SHAP|) values given by the Tree SHAP
algorithm (Lundberg, & Lee, 2017).
• The SHAP Feature Importance Plot in XGBoost generated in
Experiments 1, 2, and 3 are given in subsequent slides.
RESEARCH METHODOLOGY (CONTINUED)
10
11. SHAP FEATURE IMPORTANCE PLOT
IN XGBOOST IN EXPERIMENT 1
11
Total: 22 Features, Dropped: 1 Feature
12. SHAP FEATURE IMPORTANCE PLOT
IN XGBOOST IN EXPERIMENT 2
12
Total: 14 Features, Dropped: 3 Features
13. SHAP FEATURE IMPORTANCE PLOT
IN XGBOOST IN EXPERIMENT 3
13
Total: 3 Features, Dropped: 0 Feature
14. FEATURES PRESENT IN FEATURE-SET IN
EXPERIMENT 4
14
Sl.
No
Features Category
Sl.
No
Features Category
Sl.
No
Features Category
1 Fuzz Ratio
Lexical
Features
13
Jaccard Similarity with 1-
skip-2-gram
Lexical
Features
25 Canberra Distance
Semantic
Features
2 Fuzz Partial Ratio 14
Common 2-skip-2-gram-
Ratio
26 City Block Distance
3
Fuzz Partial Token Sort
Ratio
15
Jaccard Similarity with 2-
skip-2-gram Ratio
27 Braycurtis Distance
4 Fuzz Token Set Ratio 16 BLEU-1 28 Wu and Palmer's Measure
5 Fuzz Token Sort Ratio 17 BLEU-2 29
Shortest Path based
Measure
6 Common Unigram Ratio 18 BLEU-3 30
Leakcock and Chodorow's
Measure
7
Jaccard Similarity with
Unigram
19 BLEU-4 31
Jiang and Conrath's
Measure
8 Common Bigram Ratio 20
Absolute Length
Difference
32 Lin's Measure
9
Jaccard Similarity with
Bigram
21
Normalized Longest
Common Subsequence
33
Dependency Relation
Overlap Precision
Syntactic
Features
10 Common Trigram Ratio 22
Normalized Word Mover's
Distance
Semantic
Features
34
Dependency Relation
Overlap Recall
11
Jaccard Similarity with
Trigram
23 Minkowski Distance 35
Dependency Relation
Overlap F-measure
12
Common 1-skip-2-gram
Ratio
24 Cosine Distance
15. PERFORMANCE OF VARIOUS STANDALONE
CLASSIFIERS ON TRAINING DATASET OF MSRP
CORPUS IN EXPERIMENT 4
Sl. No. Classifier Cross Validation Accuracy
1 Logistic Regression 76.13%
2 Random Forest 75.96%
3 Support Vector Machine 75.64%
4 Gradient Boosting 75.54%
5 XGBoost 75.22%
6 K Nearest Neighbors 73.68%
7 AdaBoost 73.60%
8 Baseline 67.54%
9 LightGBM 59.95%
15
16. Sl. No. Classifier Accuracy
Per-Class F1 Score for Positive
Class
1 Logistic Regression 76.87% 83.40%
2 Support Vector Machine 76.46% 83.31%
3 LightGBM 76.00% 82.12%
4 XGBoost 75.94% 82.90%
5 Gradient Boosting 75.77% 82.83%
6 Random Forest 75.65% 82.69%
7 AdaBoost 74.55% 81.99%
8 K Nearest Neighbors 73.51% 80.87%
9 Baseline 66.49% 79.87%
PERFORMANCE OF VARIOUS STANDALONE
CLASSIFIERS ON TEST DATASET OF MSRP
CORPUS IN EXPERIMENT 4
16
17. Sl.
No.
Classifier Accuracy
Precision Recall
Per-Class F1-
Score
0 1 0 1 0 1
1 Logistic Regression 76.87% 69.08% 79.78% 56.06% 87.36% 61.89% 83.40%
2 Random Forest 75.65% 67.71% 78.42% 52.25% 87.45% 58.98% 82.69%
3
Support Vector
Machine
76.46% 69.55% 78.83% 52.94% 88.32% 60.12% 83.31%
4
Voting (Logistic
Regression, Random
Forest, Support
Vector Machine)
77.10% 70.47% 79.42% 54.50% 88.49% 61.46% 83.71%
COMPARISON OF VOTING CLASSIFIER WITH
TOP 3 STANDALONE CLASSIFIERS ON TEST
DATASET OF MSRP CORPUS IN EXPERIMENT 4
17
18. SUMMARY OF RESULTS OF CLASSIFIERS IN
VARIOUS EXPERIMENTS ON TEST DATASET OF
MSRP CORPUS
Experiment Classifier Accuracy
Per-Class F1 Score
for Positive Class
1
Logistic Regression 76.06% 82.88%
Voting (Support Vector Machine, Logistic
Regression, Random Forest)
76.23% 83.29%
2
Support Vector Machine 72.52% 81.16%
Voting (Support Vector Machine, Random
Forest, K Nearest Neighbors)
71.42% 80.30%
3
Logistic Regression 68.93% 79.59%
Voting (Gradient Boosting, Logistic
Regression, XGBoost)
69.62% 80.17%
4
Logistic Regression 76.87% 83.40%
Voting (Logistic Regression, Random
Forest, Support Vector Machine)
77.10% 83.71%
5
Support Vector Machine 75.94% 83.86%
Voting (Random Forest, Support Vector
Machine, K Nearest Neighbors)
74.72% 83.01%
18
19. DISCUSSION
• The performance of voting classifiers of developed methods in Experiments 1, 2, and 3
have been improved with the voting classifier of developed method in Experiment 4.
• The voting classifier of developed method in Experiment 4 has obtained the best
Accuracy Score of 77.10% and best Per-Class F1 Score for Positive Class of 83.71%.
• The voting classifier of developed method in Experiment 5 has obtained lower Accuracy
Score and lower Per-Class F1 Score for Positive Class than the voting classifier of
developed method in Experiment 4.
• From this result, it can be inferred that the applied Data Augmentation Techniques (Wei &
Zou, 2019) on the developed method in Experiment 5 not only do not improve but also
degrade the performance of machine learning classifiers.
• This is a new finding since Wei & Zou (2019) have demonstrated improvement in
performance of text classification by applying such data augmentation techniques on
only deep learning classifiers namely, Convolutional Neural Networks and Recurrent
Neural Networks, and not on machine learning classifiers.
• Hence, the developed method in Experiment 4 is selected as the proposed Supervised
Machine Learning Classification Paraphrase Detection System.
• The code created for the proposed Supervised Machine Learning Classification
Paraphrase Detection System, run on Google Colaboratory using Python 3 Programming
Language, is uploaded at https://github.com/Rudradityo/Paraphrase-Detection.
19
20. PROPOSED SUPERVISED MACHINE LEARNING
CLASSIFICATION PARAPHRASE DETECTION
SYSTEM
There are 4 Modules in the proposed Supervised Machine Learning Classification Paraphrase
Detection System, which are as follows:
Text Preprocessing Module
Various text preprocessing techniques have been explored specifically, removing
trailing and ending whitespaces, lowercasing, removing punctuation marks, converting
accented characters to ASCII characters, removing special characters or substituting
special characters with their literal counterpart such as ‘$’ to “dollar”, removing
numbers, removing stop words from a custom stop words list, stemming or
lemmatizing, and expanding contractions.
Techniques namely, removing trailing and ending whitespaces, lowercasing, removing
punctuation marks, and converting accented characters to ASCII characters have
been shown to give the best performance and hence, utilized in the proposed
Supervised Machine Learning Classification Paraphrase Detection System.
Feature Engineering Module
The feature-set utilized in the proposed Supervised Machine Learning Classification
Paraphrase Detection System is the same as the feature-set utilized in Experiment 4.
20
21. PROPOSED SUPERVISED MACHINE LEARNING
CLASSIFICATION PARAPHRASE DETECTION
SYSTEM (CONTINUED)
Classification Module
Various machine learning classifiers are applied to compare their performance in
paraphrase detection, as per directions for future work by Ji & Eisenstein (2013).
The machine learning classifiers used are XGBoost, Logistic Regression, Gradient
Boosting, AdaBoost, Random Forest, K Nearest Neighbors, LightGBM, and Support
Vector Machine.
A baseline majority class classifier is made use of to enable comparison with the
previously mentioned machine learning classifiers.
A soft voting classifier, constructed out of top 3 performing standalone machine
learning classifiers on the training dataset of MSRP Corpus, is also utilized.
Each constituent classifier in the soft voting classifier provides a probability value for
both class labels namely, non-paraphrastic pair (0) and paraphrastic pair (1) for a
sentence pair. For each class label, the predictions given by each constituent classifier
are averaged. The class label which is nearer to the max of two averages is then taken
as the predicted class label for the sentence pair.
21
22. PROPOSED SUPERVISED MACHINE LEARNING
CLASSIFICATION PARAPHRASE DETECTION
SYSTEM (CONTINUED)
Evaluation Module
The Cross Validation Accuracy Scores of the various standalone classifiers are
obtained for the training dataset of MSRP Corpus.
The best 3 performing standalone classifiers on the training dataset of MSRP Corpus
are utilized to form a soft voting classifier.
The Accuracy Scores and Per-Class F1-Scores for Positive Class of various
standalone classifiers are also obtained for the test dataset of MSRP Corpus.
The Accuracy Score and Per-Class F1-Score for Positive Class of the soft voting
classifier are together taken as the final performance score for the proposed
Supervised Machine Learning Classification Paraphrase Detection System.
22
23. SYSTEM ARCHITECTURE
TEXT PREPROCESSING FEATURE ENGINEERING
CLASSIFICATION EVALUATION
23
The modules described previously are illustrated above in sequence.
24. EVALUATION
• The performance of the proposed Supervised Machine Learning
Classification Paraphrase Detection System on MSRP Corpus
has been evaluated using classification evaluation metrics
namely, Accuracy and Per-class F1 Score for Positive Class.
• The formula for Accuracy is given by:
𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 =
𝑻𝑷 + 𝑻𝑵
𝑻𝑷 + 𝑭𝑷 + 𝑻𝑵 + 𝑭𝑵
• The formula for Per-class F1 Score for Positive Class is given by:
𝐏𝐞𝐫 − 𝐜𝐥𝐚𝐬𝐬 𝐅𝟏 𝐒𝐜𝐨𝐫𝐞 𝐟𝐨𝐫 𝐏𝐨𝐬𝐢𝐭𝐢𝐯𝐞 𝐂𝐥𝐚𝐬𝐬 =
𝟐𝑻𝑷
𝟐𝑻𝑷 + 𝑭𝑷 + 𝑭𝑵
where TP (True Positive), TN (True Negative), FP (False Positive),
and FN (False Negative) are terms present in confusion matrix.
24
25. EVALUATION (CONTINUED)
Sl.
No.
Paraphrase Detection System Accuracy
Per Class F1 Score
for Positive Class
1 Qiu, Kan, & Chua (2006) 72% 81.6%
2 Kozareva, & Montoyo (2006) 76.6% 79.6%
3 Ul-Qayyum, & Altaf (2012) 74.7% 81.8%
4 Finch, Hwang, & Sumita (2005) 75.0% 82.7%
5 Wan, Dras, Dale, & Paris (2006) 75.6% 83%
6 Madnani, Tetreault, & Chodorow (2012) 77.4% 84.1%
7 Das & Smith (2009) 76.1% 82.7%
8 Ji & Eisenstein (2013) 80.4% 86.0%
9 Filice, Martino & Moschitti (2015) 79.1% 85.2%
10
Proposed Supervised Machine Learning
Classification Paraphrase Detection System
77.1% 83.7%
The comparison of performance of Proposed Supervised Machine Learning
Classification Paraphrase Detection System with existing paraphrase detection
systems on Test Dataset of MSRP Corpus is illustrated in the table below:
25
26. EVALUATION (CONTINUED)
• Paraphrase detection methods utilizing deep learning classifiers
are not included for comparison since the proposed paraphrase
detection method does not utilize any deep learning classifier.
• The proposed Supervised Machine Learning Classification
Paraphrase Detection System has achieved Accuracy Score of
77.10% and Per-Class F1 Score for Positive Class of 83.71% on
test dataset of MSRP Corpus.
• The proposed Supervised Machine Learning Classification
Paraphrase Detection System has achieved better performance
than 6 paraphrase detection systems namely, Qiu, Kan, & Chua
(2006), Kozareva & Montoyo (2006), Ul-Qayyum & Altaf (2012),
Finch, Hwang, & Sumita (2005), Wan, Dras, Dale, & Paris (2006),
and Das & Smith (2009) out of 9 paraphrase detection systems.
The remaining 3 paraphrase detection systems namely, Madnani,
Tetreault, & Chodorow (2012), Ji & Eisenstein (2013), and Filice,
Martino & Moschitti (2015) have declared better performance as
they have used some advanced features which could not be
replicated or improved upon.
26
27. CONCLUSION AND FUTURE WORK
• The proposed Supervised Machine Learning Classification Paraphrase Detection System
has achieved comparable performance with existing paraphrase detection systems.
• The major contributions of this project are as follows:
Utilization of a unique combination of lexical, semantic, and syntactic features in the
proposed Supervised Machine Learning Classification Paraphrase Detection System
which, to the best of my knowledge, has not been explored previously in the same
feature-set.
Utilization of SHAP Feature Importance Plots in XGBoost as a feature selection
technique for the various experiments is undertaken to remove the features with zero
contribution in order to improve the performance of proposed Supervised Machine
Learning Classification Paraphrase Detection System. In all the research papers
referenced, there is, to the best of my knowledge, no usage of SHAP Importance Plot
for either illustrating the contribution of various features or as a feature selection
technique.
Application of Data Augmentation Techniques (Wei & Zou, 2019) has reduced
performance of machine learning classifiers, contrary to the findings of the authors
who have demonstrated improvement of performance on text classification by
applying such data augmentation techniques on only deep learning classifiers namely,
Convolutional Neural Networks and Recurrent Neural Networks, and not machine
learning classifiers.
27
28. CONCLUSION AND FUTURE WORK (CONTINUED)
The soft voting classifier composed of standalone classifiers namely, Logistic
Regression, Random Forest, and Support Vector Machine has performed slightly
better than each of the various previously mentioned standalone classifier in the
proposed Supervised Machine Learning Classification Paraphrase Detection
System. This finding is in agreement with the results of Kozareva & Montoyo
(2006).
• For future work, more advanced features of lexical, semantic, and syntactic category in
addition to the ones utilized can be added in the feature-set of the proposed Supervised
Machine Learning Classification Paraphrase Detection System for better performance in
paraphrase detection.
28
29. REFERENCES
1. Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., Constant, N.,
Céspedes, M. G., Yuan, S., Tar C., Sung, Y. H., Strope B. & Kurzweil R. (2018).
Universal sentence encoder. arXiv preprint arXiv:1803.11175.
2. Chen, B., & Cherry, C. (2014, June). A systematic comparison of smoothing
techniques for sentence-level bleu. In Proceedings of the Ninth Workshop on
Statistical Machine Translation (pp. 362-367).
3. Chitra, A., & Rajkumar, A. (2013). Genetic algorithm based feature selection for
paraphrase recognition. International Journal on Artificial Intelligence Tools, 22(02),
1350007.
4. Chitra, A., & Rajkumar, A. (2016). Plagiarism detection using machine learning-
based paraphrase recognizer. Journal of Intelligent Systems, 25(3), 351-359.
5. Das, D., & Smith, N. A. (2009, August). Paraphrase identification as probabilistic
quasi-synchronous recognition. In Proceedings of the Joint Conference of the 47th
Annual Meeting of the ACL and the 4th International Joint Conference on Natural
Language Processing of the AFNLP: Volume 1-Volume 1 (pp. 468-476). Association
for Computational Linguistics.
6. Dolan, Bill & Quirk, Chris & Brockett, Chris. (2004a). Unsupervised construction of
large paraphrase corpora: Exploiting massively parallel news sources. Proceedings
of the 20th International Conference on Computational Linguistics.
10.3115/1220355.1220406.
29
30. REFERENCES (CONTINUED)
7. El Desouki, M. I., & Gomaa, W. H. (2019). Exploring the Recent Trends of
Paraphrase Detection. International Journal of Computer Applications, 975, 8887
8. Filice, S., Da San Martino, G., & Moschitti, A. (2015, July). Structural
representations for learning relations between pairs of texts. In Proceedings of the
53rd Annual Meeting of the Association for Computational Linguistics and the 7th
International Joint Conference on Natural Language Processing (Volume 1: Long
Papers) (pp. 1003-1013).
9. Finch, A., Hwang, Y. S., & Sumita, E. (2005). Using machine translation evaluation
techniques to determine sentence-level semantic equivalence. In Proceedings of
the third international workshop on paraphrasing (IWP2005).
10. Ji, Y., & Eisenstein, J. (2013, October). Discriminative improvements to distributional
sentence similarity. In Proceedings of the 2013 Conference on Empirical Methods in
Natural Language Processing (pp. 891-896).
11. Kozareva, Z., & Montoyo, A. (2006, August). Paraphrase identification on the basis
of supervised machine learning techniques. In International Conference on Natural
Language Processing (in Finland) (pp. 524-533). Springer, Berlin, Heidelberg.
12. Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word
embeddings to document distances. In International conference on machine
learning (pp. 957-966).
30
31. REFERENCES (CONTINUED)
13. Lesk, M. (1986, June). Automatic sense disambiguation using machine readable
dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the
5th annual international conference on Systems documentation (pp. 24-26).
14. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model
predictions. In Advances in neural information processing systems (pp. 4765-4774).
15. Madnani, N., Tetreault, J., & Chodorow, M. (2012, June). Re-examining machine
translation metrics for paraphrase identification. In Proceedings of the 2012
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (pp. 182-190). Association for
Computational Linguistics.
16. Meng, L., Huang, R., & Gu, J. (2013). A review of semantic similarity measures in
wordnet. International Journal of Hybrid Information Technology, 6(1), 1-12.
17. Qiu, L., Kan, M. Y., & Chua, T. S. (2006, July). Paraphrase recognition via
dissimilarity significance classification. In Proceedings of the 2006 Conference on
Empirical Methods in Natural Language Processing (pp. 18-26). Association for
Computational Linguistics.
18. Quirk, Chris & Brockett, Chris & Dolan, William. (2004b). Monolingual Machine
Translation for Paraphrase Generation. Conference: Proceedings of the 2004
Conference on Empirical Methods in Natural Language Processing, EMNLP 2004
142-149.
31
32. REFERENCES (CONTINUED)
19. Ul-Qayyum, Z., & Altaf, W. (2012). Paraphrase identification using semantic
heuristic features. Research Journal of Applied Sciences, Engineering and
Technology, 4(22), 4894-4904.
20. Uribe, D. (2009, November). Effectively using monotonicity analysis for paraphrase
identification. In 2009 Eighth Mexican International Conference on Artificial
Intelligence (pp. 108-113). IEEE.
21. Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification.
Information Processing & Management, 50(1), 104-112.
22. Wan, Stephen & Dras, Mark & Dale, Robert & Paris, Cécile. (2006). Using
dependency-based features to take the “para-farce” out of paraphrase. Proceedings
of the Australasian Language Technology Workshop. 131-138.
23. Wei, Jason & Zou, Kai. (2019). EDA: Easy Data Augmentation Techniques for
Boosting Performance on Text Classification Tasks.
32
33. ACKNOWLEDGEMENT
I am highly indebted to my Guide Dr. Bharadwaja Kumar. G for his
guidance, feedback, and support for the project.
I would also like to express my special thanks of gratitude to Head
of the Department Dr. Justus and Capstone Project Coordinator
Dr. B V A N S S Prabhakar Rao for their support and
encouragement to complete the project.
I am also grateful to my parents for their support and
encouragement during the completion of the project.
33
THANK YOU