SlideShare a Scribd company logo
1 of 33
A NOVEL APPROACH FOR
DEVELOPING PARAPHRASE
DETECTION SYSTEM USING
MACHINE LEARNING
RUDRADITYO SAHA (16BCE1062)
GUIDE: Dr. BHARADWAJA KUMAR. G
INTRODUCTION
• The advent of Internet has made easy and instant access to
vast amounts of information possible.
• The downside to this is that this has led to significant increase
in incidents of plagiarism.
• Detecting plagiarism is a difficult task since there can be
numerous major or minor changes made to a sentence at
several levels, namely, lexical, semantic, and syntactic level, to
construct a paraphrased or plagiarized sentence posing as
original.
2
LITERATURE REVIEW
• An in-depth Literature Review has been carried out in the domains of
paraphrase detection systems, and performance enhancement
techniques for paraphrase detection (23 research papers have been
referenced).
• 11 paraphrase detection systems have been explored in Literature
Review of which some of the major ones have been briefly described
below:
 Kozareva & Montoyo (2006) have used a voting classifier composed of 3
individual classifiers namely, support vector machine, k-nearest neighbors and
logistic regression for performing paraphrase detection using various lexical
and semantic features. Some features utilized are ratio between common
consecutive n-grams between two sentences and total number of words in the
two sentences, Longest Common Subsequence and a WordNet based
feature.
 Finch, Hwang, & Sumita (2005) have used machine translation evaluation
techniques such as Bilingual Evaluation Understudy (BLEU) and Word Error
Rate (WER) as features to be used as input to a support vector machine
classifier for paraphrase detection.
3
LITERATURE REVIEW (CONTINUED)
 Wan, Dras, Dale, & Paris (2006) have used various lexical features such
as n-gram based features, BLEU based features and syntactic features
such as dependency relation based features. A support vector machine
classifier has been used to classify a sentence pair into either a
paraphrastic pair or a non-paraphrastic pair with the help of such
features.
 Chitra & Rajkumar (2013) have developed a support vector machine
based paraphrase recognition system with the help of various lexical,
syntactic and semantic features to detect paraphrases.
 Some features utilized are Longest Common Subsequence, BLEU based
features, skipgram based features, and Dependency Relation Overlap.
 A feature selection technique namely, genetic algorithm is utilized to not
only improve accuracy of suggested paraphrase recognition method but to
also reduce number of features.
 The study carried out by (El Desouki & Gomaa, 2019) provides a
comprehensive discussion on the recent research carried out on
paraphrase detection methods categorized broadly under three
categories namely, unsupervised learning, supervised machine learning
and supervised deep learning.
4
LITERATURE REVIEW (CONTINUED)
 Since the proposed paraphrase detection method is a machine learning-
based paraphrase detection method, comparison will be done with
machine learning based paraphrase detection methods mentioned by El
Desouki & Gomaa (2019).
• Literature Review is also carried out to explore performance
enhancement techniques for paraphrase detection methods which are
as below:
 The study carried out by (Uysal & Gunal, 2014) explores the impact of
commonly used preprocessing tasks such as removing stop-words,
lowercasing and stemming in all possible combinations in two
languages, namely Turkish and English, on two different domains,
specifically news and e-mails for text classification by keeping in
consideration aspects such as accuracy and dimension size.
 The study concludes that text preprocessing in text classification is as
important as feature selection for improving performance of classifier, some
text preprocessing tasks such as lowercasing improve classification
performance regardless of domain and language, all possible combinations
of preprocessing tasks must be tried for finding out the best combination,
and removing stop words leads to decrease in performance of classifier.
5
LITERATURE REVIEW (CONTINUED)
 Wei and Zou (2019) have presented Easy Data Augmentation
(EDA) techniques namely, random insertion, random swap,
random deletion, and synonym replacement for improving text
classification performance by augmenting or adding textual
data to the training dataset of a corpus.
 Wei and Zou (2019) have shown that EDA enhances performance
for deep learning classifiers such as convolutional neural
networks for performing classification.
 EDA is shown to have performed particularly well for small
datasets.
6
RESEARCH GAP
• Some combinations of various lexical, semantic, and syntactic
features, to the best of my knowledge, have not been explored
in the same feature-set. For example, semantic features such
as Word2Vec Embeddings based Features, Sentence
Embeddings based Features and Features based on WordNet
Similarity Measures have not been explored in the same
feature-set.
• Data Augmentation Techniques (Wei & Zou, 2019), to the best
of my knowledge, have not been explored to improve
performance of machine learning classifiers for paraphrase
detection.
• Feature Selection Techniques based on Feature Importance
Plots given by SHAP (Lundberg, & Lee, 2017), to the best of
my knowledge, have not been explored to improve paraphrase
detection performance.
7
• To develop a novel Supervised Machine Learning
Classification Paraphrase Detection System, with paraphrase
detection performance comparable to existing paraphrase
detection systems.
• To improve performance of proposed Supervised Machine
Learning Classification Paraphrase Detection System by taking
the following steps:
 Use of a unique combination of lexical, semantic, and
syntactic features.
 Utilization of SHAP Feature Importance Plots in XGBoost as a
feature selection technique so as to remove the features with
zero contribution.
 Application of Data Augmentation Techniques (Wei & Zou,
2019) on training dataset.
 Use of a soft voting classifier comprising of top 3 performing
standalone classifiers on training dataset.
OBJECTIVE
8
• For developing a Supervised Machine Learning Classification Paraphrase
Detection System, 5 experiments are conducted using Microsoft Research
Paraphrase Corpus (MSRP Corpus) (Quirk, Brockett, & Dolan, 2004a,b).
• The MSRP Corpus consists of 5801 sentence pairs out of which 4076
sentence pairs are present in training dataset and 1725 sentence pairs are
present in test dataset.
• Since, paraphrase detection is a binary classification problem; each sentence
pair is assigned a value “1” or “0” as quality with “1” indicating that the
sentences in the sentence pair are paraphrases and “0” indicating that the
sentences in the sentence pair are not paraphrases.
• In Experiments 1, 2, and 3, the performance of various standalone classifiers
and a soft voting classifier are assessed with the help of various lexical,
semantic, and syntactic features, respectively.
• In Experiment 4, a feature-set composed of various lexical, semantic, and
syntactic features are utilized for evaluating performance of various standalone
classifiers and a soft voting classifier.
• In Experiment 5, Data Augmentation Techniques (Wei & Zou, 2019) are
applied on the training dataset of MSRP Corpus to improve the performance of
various standalone classifiers and a soft voting classifier using the feature-set
utilized in Experiment 4.
RESEARCH METHODOLOGY
9
• From Experiments 1, 2, and 3, the features with zero contribution
found with the help of SHAP Feature Importance Plot in XGBoost
are excluded from the feature-set utilized in Experiment 4.
• A SHAP Feature Importance Plot gives the importance values of
various features in descending order present in the feature-set
using the mean (|Tree SHAP|) values given by the Tree SHAP
algorithm (Lundberg, & Lee, 2017).
• The SHAP Feature Importance Plot in XGBoost generated in
Experiments 1, 2, and 3 are given in subsequent slides.
RESEARCH METHODOLOGY (CONTINUED)
10
SHAP FEATURE IMPORTANCE PLOT
IN XGBOOST IN EXPERIMENT 1
11
Total: 22 Features, Dropped: 1 Feature
SHAP FEATURE IMPORTANCE PLOT
IN XGBOOST IN EXPERIMENT 2
12
Total: 14 Features, Dropped: 3 Features
SHAP FEATURE IMPORTANCE PLOT
IN XGBOOST IN EXPERIMENT 3
13
Total: 3 Features, Dropped: 0 Feature
FEATURES PRESENT IN FEATURE-SET IN
EXPERIMENT 4
14
Sl.
No
Features Category
Sl.
No
Features Category
Sl.
No
Features Category
1 Fuzz Ratio
Lexical
Features
13
Jaccard Similarity with 1-
skip-2-gram
Lexical
Features
25 Canberra Distance
Semantic
Features
2 Fuzz Partial Ratio 14
Common 2-skip-2-gram-
Ratio
26 City Block Distance
3
Fuzz Partial Token Sort
Ratio
15
Jaccard Similarity with 2-
skip-2-gram Ratio
27 Braycurtis Distance
4 Fuzz Token Set Ratio 16 BLEU-1 28 Wu and Palmer's Measure
5 Fuzz Token Sort Ratio 17 BLEU-2 29
Shortest Path based
Measure
6 Common Unigram Ratio 18 BLEU-3 30
Leakcock and Chodorow's
Measure
7
Jaccard Similarity with
Unigram
19 BLEU-4 31
Jiang and Conrath's
Measure
8 Common Bigram Ratio 20
Absolute Length
Difference
32 Lin's Measure
9
Jaccard Similarity with
Bigram
21
Normalized Longest
Common Subsequence
33
Dependency Relation
Overlap Precision
Syntactic
Features
10 Common Trigram Ratio 22
Normalized Word Mover's
Distance
Semantic
Features
34
Dependency Relation
Overlap Recall
11
Jaccard Similarity with
Trigram
23 Minkowski Distance 35
Dependency Relation
Overlap F-measure
12
Common 1-skip-2-gram
Ratio
24 Cosine Distance
PERFORMANCE OF VARIOUS STANDALONE
CLASSIFIERS ON TRAINING DATASET OF MSRP
CORPUS IN EXPERIMENT 4
Sl. No. Classifier Cross Validation Accuracy
1 Logistic Regression 76.13%
2 Random Forest 75.96%
3 Support Vector Machine 75.64%
4 Gradient Boosting 75.54%
5 XGBoost 75.22%
6 K Nearest Neighbors 73.68%
7 AdaBoost 73.60%
8 Baseline 67.54%
9 LightGBM 59.95%
15
Sl. No. Classifier Accuracy
Per-Class F1 Score for Positive
Class
1 Logistic Regression 76.87% 83.40%
2 Support Vector Machine 76.46% 83.31%
3 LightGBM 76.00% 82.12%
4 XGBoost 75.94% 82.90%
5 Gradient Boosting 75.77% 82.83%
6 Random Forest 75.65% 82.69%
7 AdaBoost 74.55% 81.99%
8 K Nearest Neighbors 73.51% 80.87%
9 Baseline 66.49% 79.87%
PERFORMANCE OF VARIOUS STANDALONE
CLASSIFIERS ON TEST DATASET OF MSRP
CORPUS IN EXPERIMENT 4
16
Sl.
No.
Classifier Accuracy
Precision Recall
Per-Class F1-
Score
0 1 0 1 0 1
1 Logistic Regression 76.87% 69.08% 79.78% 56.06% 87.36% 61.89% 83.40%
2 Random Forest 75.65% 67.71% 78.42% 52.25% 87.45% 58.98% 82.69%
3
Support Vector
Machine
76.46% 69.55% 78.83% 52.94% 88.32% 60.12% 83.31%
4
Voting (Logistic
Regression, Random
Forest, Support
Vector Machine)
77.10% 70.47% 79.42% 54.50% 88.49% 61.46% 83.71%
COMPARISON OF VOTING CLASSIFIER WITH
TOP 3 STANDALONE CLASSIFIERS ON TEST
DATASET OF MSRP CORPUS IN EXPERIMENT 4
17
SUMMARY OF RESULTS OF CLASSIFIERS IN
VARIOUS EXPERIMENTS ON TEST DATASET OF
MSRP CORPUS
Experiment Classifier Accuracy
Per-Class F1 Score
for Positive Class
1
Logistic Regression 76.06% 82.88%
Voting (Support Vector Machine, Logistic
Regression, Random Forest)
76.23% 83.29%
2
Support Vector Machine 72.52% 81.16%
Voting (Support Vector Machine, Random
Forest, K Nearest Neighbors)
71.42% 80.30%
3
Logistic Regression 68.93% 79.59%
Voting (Gradient Boosting, Logistic
Regression, XGBoost)
69.62% 80.17%
4
Logistic Regression 76.87% 83.40%
Voting (Logistic Regression, Random
Forest, Support Vector Machine)
77.10% 83.71%
5
Support Vector Machine 75.94% 83.86%
Voting (Random Forest, Support Vector
Machine, K Nearest Neighbors)
74.72% 83.01%
18
DISCUSSION
• The performance of voting classifiers of developed methods in Experiments 1, 2, and 3
have been improved with the voting classifier of developed method in Experiment 4.
• The voting classifier of developed method in Experiment 4 has obtained the best
Accuracy Score of 77.10% and best Per-Class F1 Score for Positive Class of 83.71%.
• The voting classifier of developed method in Experiment 5 has obtained lower Accuracy
Score and lower Per-Class F1 Score for Positive Class than the voting classifier of
developed method in Experiment 4.
• From this result, it can be inferred that the applied Data Augmentation Techniques (Wei &
Zou, 2019) on the developed method in Experiment 5 not only do not improve but also
degrade the performance of machine learning classifiers.
• This is a new finding since Wei & Zou (2019) have demonstrated improvement in
performance of text classification by applying such data augmentation techniques on
only deep learning classifiers namely, Convolutional Neural Networks and Recurrent
Neural Networks, and not on machine learning classifiers.
• Hence, the developed method in Experiment 4 is selected as the proposed Supervised
Machine Learning Classification Paraphrase Detection System.
• The code created for the proposed Supervised Machine Learning Classification
Paraphrase Detection System, run on Google Colaboratory using Python 3 Programming
Language, is uploaded at https://github.com/Rudradityo/Paraphrase-Detection.
19
PROPOSED SUPERVISED MACHINE LEARNING
CLASSIFICATION PARAPHRASE DETECTION
SYSTEM
There are 4 Modules in the proposed Supervised Machine Learning Classification Paraphrase
Detection System, which are as follows:
 Text Preprocessing Module
 Various text preprocessing techniques have been explored specifically, removing
trailing and ending whitespaces, lowercasing, removing punctuation marks, converting
accented characters to ASCII characters, removing special characters or substituting
special characters with their literal counterpart such as ‘$’ to “dollar”, removing
numbers, removing stop words from a custom stop words list, stemming or
lemmatizing, and expanding contractions.
 Techniques namely, removing trailing and ending whitespaces, lowercasing, removing
punctuation marks, and converting accented characters to ASCII characters have
been shown to give the best performance and hence, utilized in the proposed
Supervised Machine Learning Classification Paraphrase Detection System.
 Feature Engineering Module
 The feature-set utilized in the proposed Supervised Machine Learning Classification
Paraphrase Detection System is the same as the feature-set utilized in Experiment 4.
20
PROPOSED SUPERVISED MACHINE LEARNING
CLASSIFICATION PARAPHRASE DETECTION
SYSTEM (CONTINUED)
 Classification Module
 Various machine learning classifiers are applied to compare their performance in
paraphrase detection, as per directions for future work by Ji & Eisenstein (2013).
 The machine learning classifiers used are XGBoost, Logistic Regression, Gradient
Boosting, AdaBoost, Random Forest, K Nearest Neighbors, LightGBM, and Support
Vector Machine.
 A baseline majority class classifier is made use of to enable comparison with the
previously mentioned machine learning classifiers.
 A soft voting classifier, constructed out of top 3 performing standalone machine
learning classifiers on the training dataset of MSRP Corpus, is also utilized.
 Each constituent classifier in the soft voting classifier provides a probability value for
both class labels namely, non-paraphrastic pair (0) and paraphrastic pair (1) for a
sentence pair. For each class label, the predictions given by each constituent classifier
are averaged. The class label which is nearer to the max of two averages is then taken
as the predicted class label for the sentence pair.
21
PROPOSED SUPERVISED MACHINE LEARNING
CLASSIFICATION PARAPHRASE DETECTION
SYSTEM (CONTINUED)
 Evaluation Module
 The Cross Validation Accuracy Scores of the various standalone classifiers are
obtained for the training dataset of MSRP Corpus.
 The best 3 performing standalone classifiers on the training dataset of MSRP Corpus
are utilized to form a soft voting classifier.
 The Accuracy Scores and Per-Class F1-Scores for Positive Class of various
standalone classifiers are also obtained for the test dataset of MSRP Corpus.
 The Accuracy Score and Per-Class F1-Score for Positive Class of the soft voting
classifier are together taken as the final performance score for the proposed
Supervised Machine Learning Classification Paraphrase Detection System.
22
SYSTEM ARCHITECTURE
TEXT PREPROCESSING FEATURE ENGINEERING
CLASSIFICATION EVALUATION
23
The modules described previously are illustrated above in sequence.
EVALUATION
• The performance of the proposed Supervised Machine Learning
Classification Paraphrase Detection System on MSRP Corpus
has been evaluated using classification evaluation metrics
namely, Accuracy and Per-class F1 Score for Positive Class.
• The formula for Accuracy is given by:
𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 =
𝑻𝑷 + 𝑻𝑵
𝑻𝑷 + 𝑭𝑷 + 𝑻𝑵 + 𝑭𝑵
• The formula for Per-class F1 Score for Positive Class is given by:
𝐏𝐞𝐫 − 𝐜𝐥𝐚𝐬𝐬 𝐅𝟏 𝐒𝐜𝐨𝐫𝐞 𝐟𝐨𝐫 𝐏𝐨𝐬𝐢𝐭𝐢𝐯𝐞 𝐂𝐥𝐚𝐬𝐬 =
𝟐𝑻𝑷
𝟐𝑻𝑷 + 𝑭𝑷 + 𝑭𝑵
where TP (True Positive), TN (True Negative), FP (False Positive),
and FN (False Negative) are terms present in confusion matrix.
24
EVALUATION (CONTINUED)
Sl.
No.
Paraphrase Detection System Accuracy
Per Class F1 Score
for Positive Class
1 Qiu, Kan, & Chua (2006) 72% 81.6%
2 Kozareva, & Montoyo (2006) 76.6% 79.6%
3 Ul-Qayyum, & Altaf (2012) 74.7% 81.8%
4 Finch, Hwang, & Sumita (2005) 75.0% 82.7%
5 Wan, Dras, Dale, & Paris (2006) 75.6% 83%
6 Madnani, Tetreault, & Chodorow (2012) 77.4% 84.1%
7 Das & Smith (2009) 76.1% 82.7%
8 Ji & Eisenstein (2013) 80.4% 86.0%
9 Filice, Martino & Moschitti (2015) 79.1% 85.2%
10
Proposed Supervised Machine Learning
Classification Paraphrase Detection System
77.1% 83.7%
The comparison of performance of Proposed Supervised Machine Learning
Classification Paraphrase Detection System with existing paraphrase detection
systems on Test Dataset of MSRP Corpus is illustrated in the table below:
25
EVALUATION (CONTINUED)
• Paraphrase detection methods utilizing deep learning classifiers
are not included for comparison since the proposed paraphrase
detection method does not utilize any deep learning classifier.
• The proposed Supervised Machine Learning Classification
Paraphrase Detection System has achieved Accuracy Score of
77.10% and Per-Class F1 Score for Positive Class of 83.71% on
test dataset of MSRP Corpus.
• The proposed Supervised Machine Learning Classification
Paraphrase Detection System has achieved better performance
than 6 paraphrase detection systems namely, Qiu, Kan, & Chua
(2006), Kozareva & Montoyo (2006), Ul-Qayyum & Altaf (2012),
Finch, Hwang, & Sumita (2005), Wan, Dras, Dale, & Paris (2006),
and Das & Smith (2009) out of 9 paraphrase detection systems.
The remaining 3 paraphrase detection systems namely, Madnani,
Tetreault, & Chodorow (2012), Ji & Eisenstein (2013), and Filice,
Martino & Moschitti (2015) have declared better performance as
they have used some advanced features which could not be
replicated or improved upon.
26
CONCLUSION AND FUTURE WORK
• The proposed Supervised Machine Learning Classification Paraphrase Detection System
has achieved comparable performance with existing paraphrase detection systems.
• The major contributions of this project are as follows:
 Utilization of a unique combination of lexical, semantic, and syntactic features in the
proposed Supervised Machine Learning Classification Paraphrase Detection System
which, to the best of my knowledge, has not been explored previously in the same
feature-set.
 Utilization of SHAP Feature Importance Plots in XGBoost as a feature selection
technique for the various experiments is undertaken to remove the features with zero
contribution in order to improve the performance of proposed Supervised Machine
Learning Classification Paraphrase Detection System. In all the research papers
referenced, there is, to the best of my knowledge, no usage of SHAP Importance Plot
for either illustrating the contribution of various features or as a feature selection
technique.
 Application of Data Augmentation Techniques (Wei & Zou, 2019) has reduced
performance of machine learning classifiers, contrary to the findings of the authors
who have demonstrated improvement of performance on text classification by
applying such data augmentation techniques on only deep learning classifiers namely,
Convolutional Neural Networks and Recurrent Neural Networks, and not machine
learning classifiers.
27
CONCLUSION AND FUTURE WORK (CONTINUED)
 The soft voting classifier composed of standalone classifiers namely, Logistic
Regression, Random Forest, and Support Vector Machine has performed slightly
better than each of the various previously mentioned standalone classifier in the
proposed Supervised Machine Learning Classification Paraphrase Detection
System. This finding is in agreement with the results of Kozareva & Montoyo
(2006).
• For future work, more advanced features of lexical, semantic, and syntactic category in
addition to the ones utilized can be added in the feature-set of the proposed Supervised
Machine Learning Classification Paraphrase Detection System for better performance in
paraphrase detection.
28
REFERENCES
1. Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., Constant, N.,
Céspedes, M. G., Yuan, S., Tar C., Sung, Y. H., Strope B. & Kurzweil R. (2018).
Universal sentence encoder. arXiv preprint arXiv:1803.11175.
2. Chen, B., & Cherry, C. (2014, June). A systematic comparison of smoothing
techniques for sentence-level bleu. In Proceedings of the Ninth Workshop on
Statistical Machine Translation (pp. 362-367).
3. Chitra, A., & Rajkumar, A. (2013). Genetic algorithm based feature selection for
paraphrase recognition. International Journal on Artificial Intelligence Tools, 22(02),
1350007.
4. Chitra, A., & Rajkumar, A. (2016). Plagiarism detection using machine learning-
based paraphrase recognizer. Journal of Intelligent Systems, 25(3), 351-359.
5. Das, D., & Smith, N. A. (2009, August). Paraphrase identification as probabilistic
quasi-synchronous recognition. In Proceedings of the Joint Conference of the 47th
Annual Meeting of the ACL and the 4th International Joint Conference on Natural
Language Processing of the AFNLP: Volume 1-Volume 1 (pp. 468-476). Association
for Computational Linguistics.
6. Dolan, Bill & Quirk, Chris & Brockett, Chris. (2004a). Unsupervised construction of
large paraphrase corpora: Exploiting massively parallel news sources. Proceedings
of the 20th International Conference on Computational Linguistics.
10.3115/1220355.1220406.
29
REFERENCES (CONTINUED)
7. El Desouki, M. I., & Gomaa, W. H. (2019). Exploring the Recent Trends of
Paraphrase Detection. International Journal of Computer Applications, 975, 8887
8. Filice, S., Da San Martino, G., & Moschitti, A. (2015, July). Structural
representations for learning relations between pairs of texts. In Proceedings of the
53rd Annual Meeting of the Association for Computational Linguistics and the 7th
International Joint Conference on Natural Language Processing (Volume 1: Long
Papers) (pp. 1003-1013).
9. Finch, A., Hwang, Y. S., & Sumita, E. (2005). Using machine translation evaluation
techniques to determine sentence-level semantic equivalence. In Proceedings of
the third international workshop on paraphrasing (IWP2005).
10. Ji, Y., & Eisenstein, J. (2013, October). Discriminative improvements to distributional
sentence similarity. In Proceedings of the 2013 Conference on Empirical Methods in
Natural Language Processing (pp. 891-896).
11. Kozareva, Z., & Montoyo, A. (2006, August). Paraphrase identification on the basis
of supervised machine learning techniques. In International Conference on Natural
Language Processing (in Finland) (pp. 524-533). Springer, Berlin, Heidelberg.
12. Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word
embeddings to document distances. In International conference on machine
learning (pp. 957-966).
30
REFERENCES (CONTINUED)
13. Lesk, M. (1986, June). Automatic sense disambiguation using machine readable
dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the
5th annual international conference on Systems documentation (pp. 24-26).
14. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model
predictions. In Advances in neural information processing systems (pp. 4765-4774).
15. Madnani, N., Tetreault, J., & Chodorow, M. (2012, June). Re-examining machine
translation metrics for paraphrase identification. In Proceedings of the 2012
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (pp. 182-190). Association for
Computational Linguistics.
16. Meng, L., Huang, R., & Gu, J. (2013). A review of semantic similarity measures in
wordnet. International Journal of Hybrid Information Technology, 6(1), 1-12.
17. Qiu, L., Kan, M. Y., & Chua, T. S. (2006, July). Paraphrase recognition via
dissimilarity significance classification. In Proceedings of the 2006 Conference on
Empirical Methods in Natural Language Processing (pp. 18-26). Association for
Computational Linguistics.
18. Quirk, Chris & Brockett, Chris & Dolan, William. (2004b). Monolingual Machine
Translation for Paraphrase Generation. Conference: Proceedings of the 2004
Conference on Empirical Methods in Natural Language Processing, EMNLP 2004
142-149.
31
REFERENCES (CONTINUED)
19. Ul-Qayyum, Z., & Altaf, W. (2012). Paraphrase identification using semantic
heuristic features. Research Journal of Applied Sciences, Engineering and
Technology, 4(22), 4894-4904.
20. Uribe, D. (2009, November). Effectively using monotonicity analysis for paraphrase
identification. In 2009 Eighth Mexican International Conference on Artificial
Intelligence (pp. 108-113). IEEE.
21. Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification.
Information Processing & Management, 50(1), 104-112.
22. Wan, Stephen & Dras, Mark & Dale, Robert & Paris, Cécile. (2006). Using
dependency-based features to take the “para-farce” out of paraphrase. Proceedings
of the Australasian Language Technology Workshop. 131-138.
23. Wei, Jason & Zou, Kai. (2019). EDA: Easy Data Augmentation Techniques for
Boosting Performance on Text Classification Tasks.
32
ACKNOWLEDGEMENT
I am highly indebted to my Guide Dr. Bharadwaja Kumar. G for his
guidance, feedback, and support for the project.
I would also like to express my special thanks of gratitude to Head
of the Department Dr. Justus and Capstone Project Coordinator
Dr. B V A N S S Prabhakar Rao for their support and
encouragement to complete the project.
I am also grateful to my parents for their support and
encouragement during the completion of the project.
33
THANK YOU

More Related Content

What's hot

USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
 
Mood classification of songs based on lyrics
Mood classification of songs based on lyricsMood classification of songs based on lyrics
Mood classification of songs based on lyricsFrancesco Cucari
 
Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11Dhabal Sethi
 
A comparative study on term weighting methods for automated telugu text categ...
A comparative study on term weighting methods for automated telugu text categ...A comparative study on term weighting methods for automated telugu text categ...
A comparative study on term weighting methods for automated telugu text categ...IJDKP
 
Efficiently finding the parameter for emergign pattern based classifier
Efficiently finding the parameter for emergign pattern based classifierEfficiently finding the parameter for emergign pattern based classifier
Efficiently finding the parameter for emergign pattern based classifierGanesh Pavan Kambhampati
 
Implementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic ParserImplementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic ParserWaqas Tariq
 
Evaluating query-independent object features for relevancy prediction
Evaluating query-independent object features for relevancy predictionEvaluating query-independent object features for relevancy prediction
Evaluating query-independent object features for relevancy predictionNTNU
 
IRJET - Analysis of Paraphrase Detection using NLP Techniques
IRJET - Analysis of Paraphrase Detection using NLP TechniquesIRJET - Analysis of Paraphrase Detection using NLP Techniques
IRJET - Analysis of Paraphrase Detection using NLP TechniquesIRJET Journal
 
A Path-Oriented Automatic Random Testing Based on Double Constraint Propagation
A Path-Oriented Automatic Random Testing Based on Double Constraint PropagationA Path-Oriented Automatic Random Testing Based on Double Constraint Propagation
A Path-Oriented Automatic Random Testing Based on Double Constraint Propagationijseajournal
 
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...ijsrd.com
 
Classification improvement of spoken arabic language based on radial basis fu...
Classification improvement of spoken arabic language based on radial basis fu...Classification improvement of spoken arabic language based on radial basis fu...
Classification improvement of spoken arabic language based on radial basis fu...IJECEIAES
 
Capital market applications of neural networks etc
Capital market applications of neural networks etcCapital market applications of neural networks etc
Capital market applications of neural networks etc23tino
 

What's hot (18)

228-SE3001_2
228-SE3001_2228-SE3001_2
228-SE3001_2
 
D017422528
D017422528D017422528
D017422528
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
 
Mood classification of songs based on lyrics
Mood classification of songs based on lyricsMood classification of songs based on lyrics
Mood classification of songs based on lyrics
 
combination
combinationcombination
combination
 
Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11
 
A comparative study on term weighting methods for automated telugu text categ...
A comparative study on term weighting methods for automated telugu text categ...A comparative study on term weighting methods for automated telugu text categ...
A comparative study on term weighting methods for automated telugu text categ...
 
Efficiently finding the parameter for emergign pattern based classifier
Efficiently finding the parameter for emergign pattern based classifierEfficiently finding the parameter for emergign pattern based classifier
Efficiently finding the parameter for emergign pattern based classifier
 
Implementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic ParserImplementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic Parser
 
Evaluating query-independent object features for relevancy prediction
Evaluating query-independent object features for relevancy predictionEvaluating query-independent object features for relevancy prediction
Evaluating query-independent object features for relevancy prediction
 
IRJET - Analysis of Paraphrase Detection using NLP Techniques
IRJET - Analysis of Paraphrase Detection using NLP TechniquesIRJET - Analysis of Paraphrase Detection using NLP Techniques
IRJET - Analysis of Paraphrase Detection using NLP Techniques
 
A Path-Oriented Automatic Random Testing Based on Double Constraint Propagation
A Path-Oriented Automatic Random Testing Based on Double Constraint PropagationA Path-Oriented Automatic Random Testing Based on Double Constraint Propagation
A Path-Oriented Automatic Random Testing Based on Double Constraint Propagation
 
D04422730
D04422730D04422730
D04422730
 
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
 
L1803058388
L1803058388L1803058388
L1803058388
 
Classification improvement of spoken arabic language based on radial basis fu...
Classification improvement of spoken arabic language based on radial basis fu...Classification improvement of spoken arabic language based on radial basis fu...
Classification improvement of spoken arabic language based on radial basis fu...
 
Capital market applications of neural networks etc
Capital market applications of neural networks etcCapital market applications of neural networks etc
Capital market applications of neural networks etc
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 

Similar to A Novel Approach for Developing Paraphrase Detection System using Machine Learning

A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSijnlc
 
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...IEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...IEEEMEMTECHSTUDENTSPROJECTS
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
 
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...IEEEGLOBALSOFTTECHNOLOGIES
 
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXTAN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXTIJDKP
 
Cloudsim a fast clustering-based feature subset selection algorithm for high...
Cloudsim  a fast clustering-based feature subset selection algorithm for high...Cloudsim  a fast clustering-based feature subset selection algorithm for high...
Cloudsim a fast clustering-based feature subset selection algorithm for high...ecway
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...ecway
 
Android a fast clustering-based feature subset selection algorithm for high-...
Android  a fast clustering-based feature subset selection algorithm for high-...Android  a fast clustering-based feature subset selection algorithm for high-...
Android a fast clustering-based feature subset selection algorithm for high-...ecway
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...IEEEGLOBALSOFTTECHNOLOGIES
 
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...IEEEGLOBALSOFTTECHNOLOGIES
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...IEEEGLOBALSOFTTECHNOLOGIES
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
 
Query expansion using novel use case scenario relationship for finding featur...
Query expansion using novel use case scenario relationship for finding featur...Query expansion using novel use case scenario relationship for finding featur...
Query expansion using novel use case scenario relationship for finding featur...IJECEIAES
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learningcsandit
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniquesiosrjce
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
 

Similar to A Novel Approach for Developing Paraphrase Detection System using Machine Learning (20)

A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
 
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
 
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
 
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXTAN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
 
Cloudsim a fast clustering-based feature subset selection algorithm for high...
Cloudsim  a fast clustering-based feature subset selection algorithm for high...Cloudsim  a fast clustering-based feature subset selection algorithm for high...
Cloudsim a fast clustering-based feature subset selection algorithm for high...
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
Android a fast clustering-based feature subset selection algorithm for high-...
Android  a fast clustering-based feature subset selection algorithm for high-...Android  a fast clustering-based feature subset selection algorithm for high-...
Android a fast clustering-based feature subset selection algorithm for high-...
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
 
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
 
Research proposal
Research proposalResearch proposal
Research proposal
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
 
Query expansion using novel use case scenario relationship for finding featur...
Query expansion using novel use case scenario relationship for finding featur...Query expansion using novel use case scenario relationship for finding featur...
Query expansion using novel use case scenario relationship for finding featur...
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
 
D017232729
D017232729D017232729
D017232729
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
 

Recently uploaded

Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单aqpto5bt
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证pwgnohujw
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...yulianti213969
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 

Recently uploaded (20)

Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 

A Novel Approach for Developing Paraphrase Detection System using Machine Learning

  • 1. A NOVEL APPROACH FOR DEVELOPING PARAPHRASE DETECTION SYSTEM USING MACHINE LEARNING RUDRADITYO SAHA (16BCE1062) GUIDE: Dr. BHARADWAJA KUMAR. G
  • 2. INTRODUCTION • The advent of Internet has made easy and instant access to vast amounts of information possible. • The downside to this is that this has led to significant increase in incidents of plagiarism. • Detecting plagiarism is a difficult task since there can be numerous major or minor changes made to a sentence at several levels, namely, lexical, semantic, and syntactic level, to construct a paraphrased or plagiarized sentence posing as original. 2
  • 3. LITERATURE REVIEW • An in-depth Literature Review has been carried out in the domains of paraphrase detection systems, and performance enhancement techniques for paraphrase detection (23 research papers have been referenced). • 11 paraphrase detection systems have been explored in Literature Review of which some of the major ones have been briefly described below:  Kozareva & Montoyo (2006) have used a voting classifier composed of 3 individual classifiers namely, support vector machine, k-nearest neighbors and logistic regression for performing paraphrase detection using various lexical and semantic features. Some features utilized are ratio between common consecutive n-grams between two sentences and total number of words in the two sentences, Longest Common Subsequence and a WordNet based feature.  Finch, Hwang, & Sumita (2005) have used machine translation evaluation techniques such as Bilingual Evaluation Understudy (BLEU) and Word Error Rate (WER) as features to be used as input to a support vector machine classifier for paraphrase detection. 3
  • 4. LITERATURE REVIEW (CONTINUED)  Wan, Dras, Dale, & Paris (2006) have used various lexical features such as n-gram based features, BLEU based features and syntactic features such as dependency relation based features. A support vector machine classifier has been used to classify a sentence pair into either a paraphrastic pair or a non-paraphrastic pair with the help of such features.  Chitra & Rajkumar (2013) have developed a support vector machine based paraphrase recognition system with the help of various lexical, syntactic and semantic features to detect paraphrases.  Some features utilized are Longest Common Subsequence, BLEU based features, skipgram based features, and Dependency Relation Overlap.  A feature selection technique namely, genetic algorithm is utilized to not only improve accuracy of suggested paraphrase recognition method but to also reduce number of features.  The study carried out by (El Desouki & Gomaa, 2019) provides a comprehensive discussion on the recent research carried out on paraphrase detection methods categorized broadly under three categories namely, unsupervised learning, supervised machine learning and supervised deep learning. 4
  • 5. LITERATURE REVIEW (CONTINUED)  Since the proposed paraphrase detection method is a machine learning- based paraphrase detection method, comparison will be done with machine learning based paraphrase detection methods mentioned by El Desouki & Gomaa (2019). • Literature Review is also carried out to explore performance enhancement techniques for paraphrase detection methods which are as below:  The study carried out by (Uysal & Gunal, 2014) explores the impact of commonly used preprocessing tasks such as removing stop-words, lowercasing and stemming in all possible combinations in two languages, namely Turkish and English, on two different domains, specifically news and e-mails for text classification by keeping in consideration aspects such as accuracy and dimension size.  The study concludes that text preprocessing in text classification is as important as feature selection for improving performance of classifier, some text preprocessing tasks such as lowercasing improve classification performance regardless of domain and language, all possible combinations of preprocessing tasks must be tried for finding out the best combination, and removing stop words leads to decrease in performance of classifier. 5
  • 6. LITERATURE REVIEW (CONTINUED)  Wei and Zou (2019) have presented Easy Data Augmentation (EDA) techniques namely, random insertion, random swap, random deletion, and synonym replacement for improving text classification performance by augmenting or adding textual data to the training dataset of a corpus.  Wei and Zou (2019) have shown that EDA enhances performance for deep learning classifiers such as convolutional neural networks for performing classification.  EDA is shown to have performed particularly well for small datasets. 6
  • 7. RESEARCH GAP • Some combinations of various lexical, semantic, and syntactic features, to the best of my knowledge, have not been explored in the same feature-set. For example, semantic features such as Word2Vec Embeddings based Features, Sentence Embeddings based Features and Features based on WordNet Similarity Measures have not been explored in the same feature-set. • Data Augmentation Techniques (Wei & Zou, 2019), to the best of my knowledge, have not been explored to improve performance of machine learning classifiers for paraphrase detection. • Feature Selection Techniques based on Feature Importance Plots given by SHAP (Lundberg, & Lee, 2017), to the best of my knowledge, have not been explored to improve paraphrase detection performance. 7
  • 8. • To develop a novel Supervised Machine Learning Classification Paraphrase Detection System, with paraphrase detection performance comparable to existing paraphrase detection systems. • To improve performance of proposed Supervised Machine Learning Classification Paraphrase Detection System by taking the following steps:  Use of a unique combination of lexical, semantic, and syntactic features.  Utilization of SHAP Feature Importance Plots in XGBoost as a feature selection technique so as to remove the features with zero contribution.  Application of Data Augmentation Techniques (Wei & Zou, 2019) on training dataset.  Use of a soft voting classifier comprising of top 3 performing standalone classifiers on training dataset. OBJECTIVE 8
  • 9. • For developing a Supervised Machine Learning Classification Paraphrase Detection System, 5 experiments are conducted using Microsoft Research Paraphrase Corpus (MSRP Corpus) (Quirk, Brockett, & Dolan, 2004a,b). • The MSRP Corpus consists of 5801 sentence pairs out of which 4076 sentence pairs are present in training dataset and 1725 sentence pairs are present in test dataset. • Since, paraphrase detection is a binary classification problem; each sentence pair is assigned a value “1” or “0” as quality with “1” indicating that the sentences in the sentence pair are paraphrases and “0” indicating that the sentences in the sentence pair are not paraphrases. • In Experiments 1, 2, and 3, the performance of various standalone classifiers and a soft voting classifier are assessed with the help of various lexical, semantic, and syntactic features, respectively. • In Experiment 4, a feature-set composed of various lexical, semantic, and syntactic features are utilized for evaluating performance of various standalone classifiers and a soft voting classifier. • In Experiment 5, Data Augmentation Techniques (Wei & Zou, 2019) are applied on the training dataset of MSRP Corpus to improve the performance of various standalone classifiers and a soft voting classifier using the feature-set utilized in Experiment 4. RESEARCH METHODOLOGY 9
  • 10. • From Experiments 1, 2, and 3, the features with zero contribution found with the help of SHAP Feature Importance Plot in XGBoost are excluded from the feature-set utilized in Experiment 4. • A SHAP Feature Importance Plot gives the importance values of various features in descending order present in the feature-set using the mean (|Tree SHAP|) values given by the Tree SHAP algorithm (Lundberg, & Lee, 2017). • The SHAP Feature Importance Plot in XGBoost generated in Experiments 1, 2, and 3 are given in subsequent slides. RESEARCH METHODOLOGY (CONTINUED) 10
  • 11. SHAP FEATURE IMPORTANCE PLOT IN XGBOOST IN EXPERIMENT 1 11 Total: 22 Features, Dropped: 1 Feature
  • 12. SHAP FEATURE IMPORTANCE PLOT IN XGBOOST IN EXPERIMENT 2 12 Total: 14 Features, Dropped: 3 Features
  • 13. SHAP FEATURE IMPORTANCE PLOT IN XGBOOST IN EXPERIMENT 3 13 Total: 3 Features, Dropped: 0 Feature
  • 14. FEATURES PRESENT IN FEATURE-SET IN EXPERIMENT 4 14 Sl. No Features Category Sl. No Features Category Sl. No Features Category 1 Fuzz Ratio Lexical Features 13 Jaccard Similarity with 1- skip-2-gram Lexical Features 25 Canberra Distance Semantic Features 2 Fuzz Partial Ratio 14 Common 2-skip-2-gram- Ratio 26 City Block Distance 3 Fuzz Partial Token Sort Ratio 15 Jaccard Similarity with 2- skip-2-gram Ratio 27 Braycurtis Distance 4 Fuzz Token Set Ratio 16 BLEU-1 28 Wu and Palmer's Measure 5 Fuzz Token Sort Ratio 17 BLEU-2 29 Shortest Path based Measure 6 Common Unigram Ratio 18 BLEU-3 30 Leakcock and Chodorow's Measure 7 Jaccard Similarity with Unigram 19 BLEU-4 31 Jiang and Conrath's Measure 8 Common Bigram Ratio 20 Absolute Length Difference 32 Lin's Measure 9 Jaccard Similarity with Bigram 21 Normalized Longest Common Subsequence 33 Dependency Relation Overlap Precision Syntactic Features 10 Common Trigram Ratio 22 Normalized Word Mover's Distance Semantic Features 34 Dependency Relation Overlap Recall 11 Jaccard Similarity with Trigram 23 Minkowski Distance 35 Dependency Relation Overlap F-measure 12 Common 1-skip-2-gram Ratio 24 Cosine Distance
  • 15. PERFORMANCE OF VARIOUS STANDALONE CLASSIFIERS ON TRAINING DATASET OF MSRP CORPUS IN EXPERIMENT 4 Sl. No. Classifier Cross Validation Accuracy 1 Logistic Regression 76.13% 2 Random Forest 75.96% 3 Support Vector Machine 75.64% 4 Gradient Boosting 75.54% 5 XGBoost 75.22% 6 K Nearest Neighbors 73.68% 7 AdaBoost 73.60% 8 Baseline 67.54% 9 LightGBM 59.95% 15
  • 16. Sl. No. Classifier Accuracy Per-Class F1 Score for Positive Class 1 Logistic Regression 76.87% 83.40% 2 Support Vector Machine 76.46% 83.31% 3 LightGBM 76.00% 82.12% 4 XGBoost 75.94% 82.90% 5 Gradient Boosting 75.77% 82.83% 6 Random Forest 75.65% 82.69% 7 AdaBoost 74.55% 81.99% 8 K Nearest Neighbors 73.51% 80.87% 9 Baseline 66.49% 79.87% PERFORMANCE OF VARIOUS STANDALONE CLASSIFIERS ON TEST DATASET OF MSRP CORPUS IN EXPERIMENT 4 16
  • 17. Sl. No. Classifier Accuracy Precision Recall Per-Class F1- Score 0 1 0 1 0 1 1 Logistic Regression 76.87% 69.08% 79.78% 56.06% 87.36% 61.89% 83.40% 2 Random Forest 75.65% 67.71% 78.42% 52.25% 87.45% 58.98% 82.69% 3 Support Vector Machine 76.46% 69.55% 78.83% 52.94% 88.32% 60.12% 83.31% 4 Voting (Logistic Regression, Random Forest, Support Vector Machine) 77.10% 70.47% 79.42% 54.50% 88.49% 61.46% 83.71% COMPARISON OF VOTING CLASSIFIER WITH TOP 3 STANDALONE CLASSIFIERS ON TEST DATASET OF MSRP CORPUS IN EXPERIMENT 4 17
  • 18. SUMMARY OF RESULTS OF CLASSIFIERS IN VARIOUS EXPERIMENTS ON TEST DATASET OF MSRP CORPUS Experiment Classifier Accuracy Per-Class F1 Score for Positive Class 1 Logistic Regression 76.06% 82.88% Voting (Support Vector Machine, Logistic Regression, Random Forest) 76.23% 83.29% 2 Support Vector Machine 72.52% 81.16% Voting (Support Vector Machine, Random Forest, K Nearest Neighbors) 71.42% 80.30% 3 Logistic Regression 68.93% 79.59% Voting (Gradient Boosting, Logistic Regression, XGBoost) 69.62% 80.17% 4 Logistic Regression 76.87% 83.40% Voting (Logistic Regression, Random Forest, Support Vector Machine) 77.10% 83.71% 5 Support Vector Machine 75.94% 83.86% Voting (Random Forest, Support Vector Machine, K Nearest Neighbors) 74.72% 83.01% 18
  • 19. DISCUSSION • The performance of voting classifiers of developed methods in Experiments 1, 2, and 3 have been improved with the voting classifier of developed method in Experiment 4. • The voting classifier of developed method in Experiment 4 has obtained the best Accuracy Score of 77.10% and best Per-Class F1 Score for Positive Class of 83.71%. • The voting classifier of developed method in Experiment 5 has obtained lower Accuracy Score and lower Per-Class F1 Score for Positive Class than the voting classifier of developed method in Experiment 4. • From this result, it can be inferred that the applied Data Augmentation Techniques (Wei & Zou, 2019) on the developed method in Experiment 5 not only do not improve but also degrade the performance of machine learning classifiers. • This is a new finding since Wei & Zou (2019) have demonstrated improvement in performance of text classification by applying such data augmentation techniques on only deep learning classifiers namely, Convolutional Neural Networks and Recurrent Neural Networks, and not on machine learning classifiers. • Hence, the developed method in Experiment 4 is selected as the proposed Supervised Machine Learning Classification Paraphrase Detection System. • The code created for the proposed Supervised Machine Learning Classification Paraphrase Detection System, run on Google Colaboratory using Python 3 Programming Language, is uploaded at https://github.com/Rudradityo/Paraphrase-Detection. 19
  • 20. PROPOSED SUPERVISED MACHINE LEARNING CLASSIFICATION PARAPHRASE DETECTION SYSTEM There are 4 Modules in the proposed Supervised Machine Learning Classification Paraphrase Detection System, which are as follows:  Text Preprocessing Module  Various text preprocessing techniques have been explored specifically, removing trailing and ending whitespaces, lowercasing, removing punctuation marks, converting accented characters to ASCII characters, removing special characters or substituting special characters with their literal counterpart such as ‘$’ to “dollar”, removing numbers, removing stop words from a custom stop words list, stemming or lemmatizing, and expanding contractions.  Techniques namely, removing trailing and ending whitespaces, lowercasing, removing punctuation marks, and converting accented characters to ASCII characters have been shown to give the best performance and hence, utilized in the proposed Supervised Machine Learning Classification Paraphrase Detection System.  Feature Engineering Module  The feature-set utilized in the proposed Supervised Machine Learning Classification Paraphrase Detection System is the same as the feature-set utilized in Experiment 4. 20
  • 21. PROPOSED SUPERVISED MACHINE LEARNING CLASSIFICATION PARAPHRASE DETECTION SYSTEM (CONTINUED)  Classification Module  Various machine learning classifiers are applied to compare their performance in paraphrase detection, as per directions for future work by Ji & Eisenstein (2013).  The machine learning classifiers used are XGBoost, Logistic Regression, Gradient Boosting, AdaBoost, Random Forest, K Nearest Neighbors, LightGBM, and Support Vector Machine.  A baseline majority class classifier is made use of to enable comparison with the previously mentioned machine learning classifiers.  A soft voting classifier, constructed out of top 3 performing standalone machine learning classifiers on the training dataset of MSRP Corpus, is also utilized.  Each constituent classifier in the soft voting classifier provides a probability value for both class labels namely, non-paraphrastic pair (0) and paraphrastic pair (1) for a sentence pair. For each class label, the predictions given by each constituent classifier are averaged. The class label which is nearer to the max of two averages is then taken as the predicted class label for the sentence pair. 21
  • 22. PROPOSED SUPERVISED MACHINE LEARNING CLASSIFICATION PARAPHRASE DETECTION SYSTEM (CONTINUED)  Evaluation Module  The Cross Validation Accuracy Scores of the various standalone classifiers are obtained for the training dataset of MSRP Corpus.  The best 3 performing standalone classifiers on the training dataset of MSRP Corpus are utilized to form a soft voting classifier.  The Accuracy Scores and Per-Class F1-Scores for Positive Class of various standalone classifiers are also obtained for the test dataset of MSRP Corpus.  The Accuracy Score and Per-Class F1-Score for Positive Class of the soft voting classifier are together taken as the final performance score for the proposed Supervised Machine Learning Classification Paraphrase Detection System. 22
  • 23. SYSTEM ARCHITECTURE TEXT PREPROCESSING FEATURE ENGINEERING CLASSIFICATION EVALUATION 23 The modules described previously are illustrated above in sequence.
  • 24. EVALUATION • The performance of the proposed Supervised Machine Learning Classification Paraphrase Detection System on MSRP Corpus has been evaluated using classification evaluation metrics namely, Accuracy and Per-class F1 Score for Positive Class. • The formula for Accuracy is given by: 𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = 𝑻𝑷 + 𝑻𝑵 𝑻𝑷 + 𝑭𝑷 + 𝑻𝑵 + 𝑭𝑵 • The formula for Per-class F1 Score for Positive Class is given by: 𝐏𝐞𝐫 − 𝐜𝐥𝐚𝐬𝐬 𝐅𝟏 𝐒𝐜𝐨𝐫𝐞 𝐟𝐨𝐫 𝐏𝐨𝐬𝐢𝐭𝐢𝐯𝐞 𝐂𝐥𝐚𝐬𝐬 = 𝟐𝑻𝑷 𝟐𝑻𝑷 + 𝑭𝑷 + 𝑭𝑵 where TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative) are terms present in confusion matrix. 24
  • 25. EVALUATION (CONTINUED) Sl. No. Paraphrase Detection System Accuracy Per Class F1 Score for Positive Class 1 Qiu, Kan, & Chua (2006) 72% 81.6% 2 Kozareva, & Montoyo (2006) 76.6% 79.6% 3 Ul-Qayyum, & Altaf (2012) 74.7% 81.8% 4 Finch, Hwang, & Sumita (2005) 75.0% 82.7% 5 Wan, Dras, Dale, & Paris (2006) 75.6% 83% 6 Madnani, Tetreault, & Chodorow (2012) 77.4% 84.1% 7 Das & Smith (2009) 76.1% 82.7% 8 Ji & Eisenstein (2013) 80.4% 86.0% 9 Filice, Martino & Moschitti (2015) 79.1% 85.2% 10 Proposed Supervised Machine Learning Classification Paraphrase Detection System 77.1% 83.7% The comparison of performance of Proposed Supervised Machine Learning Classification Paraphrase Detection System with existing paraphrase detection systems on Test Dataset of MSRP Corpus is illustrated in the table below: 25
  • 26. EVALUATION (CONTINUED) • Paraphrase detection methods utilizing deep learning classifiers are not included for comparison since the proposed paraphrase detection method does not utilize any deep learning classifier. • The proposed Supervised Machine Learning Classification Paraphrase Detection System has achieved Accuracy Score of 77.10% and Per-Class F1 Score for Positive Class of 83.71% on test dataset of MSRP Corpus. • The proposed Supervised Machine Learning Classification Paraphrase Detection System has achieved better performance than 6 paraphrase detection systems namely, Qiu, Kan, & Chua (2006), Kozareva & Montoyo (2006), Ul-Qayyum & Altaf (2012), Finch, Hwang, & Sumita (2005), Wan, Dras, Dale, & Paris (2006), and Das & Smith (2009) out of 9 paraphrase detection systems. The remaining 3 paraphrase detection systems namely, Madnani, Tetreault, & Chodorow (2012), Ji & Eisenstein (2013), and Filice, Martino & Moschitti (2015) have declared better performance as they have used some advanced features which could not be replicated or improved upon. 26
  • 27. CONCLUSION AND FUTURE WORK • The proposed Supervised Machine Learning Classification Paraphrase Detection System has achieved comparable performance with existing paraphrase detection systems. • The major contributions of this project are as follows:  Utilization of a unique combination of lexical, semantic, and syntactic features in the proposed Supervised Machine Learning Classification Paraphrase Detection System which, to the best of my knowledge, has not been explored previously in the same feature-set.  Utilization of SHAP Feature Importance Plots in XGBoost as a feature selection technique for the various experiments is undertaken to remove the features with zero contribution in order to improve the performance of proposed Supervised Machine Learning Classification Paraphrase Detection System. In all the research papers referenced, there is, to the best of my knowledge, no usage of SHAP Importance Plot for either illustrating the contribution of various features or as a feature selection technique.  Application of Data Augmentation Techniques (Wei & Zou, 2019) has reduced performance of machine learning classifiers, contrary to the findings of the authors who have demonstrated improvement of performance on text classification by applying such data augmentation techniques on only deep learning classifiers namely, Convolutional Neural Networks and Recurrent Neural Networks, and not machine learning classifiers. 27
  • 28. CONCLUSION AND FUTURE WORK (CONTINUED)  The soft voting classifier composed of standalone classifiers namely, Logistic Regression, Random Forest, and Support Vector Machine has performed slightly better than each of the various previously mentioned standalone classifier in the proposed Supervised Machine Learning Classification Paraphrase Detection System. This finding is in agreement with the results of Kozareva & Montoyo (2006). • For future work, more advanced features of lexical, semantic, and syntactic category in addition to the ones utilized can be added in the feature-set of the proposed Supervised Machine Learning Classification Paraphrase Detection System for better performance in paraphrase detection. 28
  • 29. REFERENCES 1. Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., Constant, N., Céspedes, M. G., Yuan, S., Tar C., Sung, Y. H., Strope B. & Kurzweil R. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175. 2. Chen, B., & Cherry, C. (2014, June). A systematic comparison of smoothing techniques for sentence-level bleu. In Proceedings of the Ninth Workshop on Statistical Machine Translation (pp. 362-367). 3. Chitra, A., & Rajkumar, A. (2013). Genetic algorithm based feature selection for paraphrase recognition. International Journal on Artificial Intelligence Tools, 22(02), 1350007. 4. Chitra, A., & Rajkumar, A. (2016). Plagiarism detection using machine learning- based paraphrase recognizer. Journal of Intelligent Systems, 25(3), 351-359. 5. Das, D., & Smith, N. A. (2009, August). Paraphrase identification as probabilistic quasi-synchronous recognition. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1 (pp. 468-476). Association for Computational Linguistics. 6. Dolan, Bill & Quirk, Chris & Brockett, Chris. (2004a). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. Proceedings of the 20th International Conference on Computational Linguistics. 10.3115/1220355.1220406. 29
  • 30. REFERENCES (CONTINUED) 7. El Desouki, M. I., & Gomaa, W. H. (2019). Exploring the Recent Trends of Paraphrase Detection. International Journal of Computer Applications, 975, 8887 8. Filice, S., Da San Martino, G., & Moschitti, A. (2015, July). Structural representations for learning relations between pairs of texts. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1003-1013). 9. Finch, A., Hwang, Y. S., & Sumita, E. (2005). Using machine translation evaluation techniques to determine sentence-level semantic equivalence. In Proceedings of the third international workshop on paraphrasing (IWP2005). 10. Ji, Y., & Eisenstein, J. (2013, October). Discriminative improvements to distributional sentence similarity. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 891-896). 11. Kozareva, Z., & Montoyo, A. (2006, August). Paraphrase identification on the basis of supervised machine learning techniques. In International Conference on Natural Language Processing (in Finland) (pp. 524-533). Springer, Berlin, Heidelberg. 12. Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International conference on machine learning (pp. 957-966). 30
  • 31. REFERENCES (CONTINUED) 13. Lesk, M. (1986, June). Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation (pp. 24-26). 14. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances in neural information processing systems (pp. 4765-4774). 15. Madnani, N., Tetreault, J., & Chodorow, M. (2012, June). Re-examining machine translation metrics for paraphrase identification. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 182-190). Association for Computational Linguistics. 16. Meng, L., Huang, R., & Gu, J. (2013). A review of semantic similarity measures in wordnet. International Journal of Hybrid Information Technology, 6(1), 1-12. 17. Qiu, L., Kan, M. Y., & Chua, T. S. (2006, July). Paraphrase recognition via dissimilarity significance classification. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (pp. 18-26). Association for Computational Linguistics. 18. Quirk, Chris & Brockett, Chris & Dolan, William. (2004b). Monolingual Machine Translation for Paraphrase Generation. Conference: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004 142-149. 31
  • 32. REFERENCES (CONTINUED) 19. Ul-Qayyum, Z., & Altaf, W. (2012). Paraphrase identification using semantic heuristic features. Research Journal of Applied Sciences, Engineering and Technology, 4(22), 4894-4904. 20. Uribe, D. (2009, November). Effectively using monotonicity analysis for paraphrase identification. In 2009 Eighth Mexican International Conference on Artificial Intelligence (pp. 108-113). IEEE. 21. Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1), 104-112. 22. Wan, Stephen & Dras, Mark & Dale, Robert & Paris, Cécile. (2006). Using dependency-based features to take the “para-farce” out of paraphrase. Proceedings of the Australasian Language Technology Workshop. 131-138. 23. Wei, Jason & Zou, Kai. (2019). EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. 32
  • 33. ACKNOWLEDGEMENT I am highly indebted to my Guide Dr. Bharadwaja Kumar. G for his guidance, feedback, and support for the project. I would also like to express my special thanks of gratitude to Head of the Department Dr. Justus and Capstone Project Coordinator Dr. B V A N S S Prabhakar Rao for their support and encouragement to complete the project. I am also grateful to my parents for their support and encouragement during the completion of the project. 33 THANK YOU