SlideShare a Scribd company logo
1 of 8
Download to read offline
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 9, No. 1, February 2019, pp. 652~659
ISSN: 2088-8708, DOI: 10.11591/ijece.v9i1.pp652-659  652
Journal homepage: http://iaescore.com/journals/index.php/IJECE
The effect of training set size in authorship attribution:
application on short Arabic texts
Mohammed AL-Sarem1
, Abdel-Hamid Emara2
1Department of Information System, Taibah University, Madinah, KSA
1Department of Computer Science, Saba'a Region University, Mareb, Yemen
2Department of Computer Science, Taibah University, Madinah, KSA
2Computers and Systems Engineering Department, Al-Azhar University, Egypt
Article Info ABSTRACT
Article history:
Received Apr 14, 2018
Revised Sep 3, 2018
Accepted Sep 26, 2018
Authorship attribution (AA) is a subfield of linguistics analysis, aiming to
identify the original author among a set of candidate authors. Several
research papers were published and several methods and models were
developed for many languages. However, the number of related works for
Arabic is limited. Moreover, investigating the impact of short words length
and training set size is not well addressed. To the best of our knowledge, no
published works or researches, in this direction or even in other languages,
are available. Therefore, we propose to investigate this effect, taking into
account different stylomatric combination. The Mahalanobis distance (MD),
Linear Regression (LR), and Multilayer Perceptron (MP) are selected as AA
classifiers. During the experiment, the training dataset size is increased and
the accuracy of the classifiers is recorded. The results are quite interesting
and show different classifiers behaviors. Combining word-based stylomatric
features with n-grams provides the best accuracy reached in average 93%.
Keywords:
Arabic language
Authorship attribution training
set size
Linear regression
Mahalonobis distance
MLP classifier
Copyright © 2019 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Mohammed Al-Sarem,
Department of Information System, Taibah University,
PO box 344, Medina, Kingdom of Saudi Arabia.
Email: mohsarem@gmail.com
1. INTRODUCTION
Unlike text categorization, authorship attribution (AA) is a subfield of linguistics analysis which
aims to identify the original author among a set of candidate authors [1]. The idea can be created basically as
follows: for given text sets of known authorship, the author of the unseen text is estimated by matching the
anonymous text to one author of the candidate set. The estimated function depends on the extracted human
“stylometric fingerprint” features. Despite author's writing style that can change from topic to topic, some
persistent, uncontrolled habit and writing styles are still valid over the time.
Although the importance of Arabic language and the global position which it holds and the widely
use by hundreds of millions of people, only a very small number of authorship attribution studies that employ
machine-learning methods has been published for Arabic texts so [2]. Abbasi and Chen, e.g., applied support
vector machine (SVM) and C4.5 decision trees on political and social Arabic web forum messages [3].
Despite the notable results they have, the dataset is quite large to extract enough features. Stamatatos [4]
tested also the use of SVM on Arabic newspaper. Although the experiment was conducted to investigate the
effect of class imbalance problem, the findings encouraged us to investigate which size and length effect the
performance of AA classifiers has. In [2], the linear discriminant analysis (LDA) is used. For attributing text,
function words were used as feature. At training and testing phase, the used texts were divided into two
chunks: the first chunk with 1000-words, while the second chunk with 2000-words. The findings indicate that
the longer the text is, the higher the obtained performance. The best performance they obtained with the
Int J Elec & Comp Eng ISSN: 2088-8708 
The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem)
653
second chunk was around 87% accuracy. The same findings were reported in [5] in which the authors
examined the performance of several classifiers, namely, SVM, MLP, Linear Regression, Stamatatos
distance and Manhattan distance. The results confirmed the findings by Eder in 2013 for the English
language [6] and the minimum size of textual data is 2500 words per documents.
Existing works on authorship attribution are conducted with long text size. Besides that, there are
many problems facing researchers in this domain, especially in Arabic:
a. Researches in AA are quite few and are not tackled as much as in other languages;
b. Absence a benchmark data sets, full adaptable support tools lead to increase the exerted effort in
extracting the necessary features;
c. Nature of Arabic language also adds extra steps to text preprocessing, and words in Arabic tend to be
short, which might reduce the effectiveness of some stylistic features, such as word length
distribution [3].
Like any supervised learning problem, solving the AA problems go through three key phases: (i)
data preprocessing phase where the typical dataset collection, feature extraction, feature selection, and
dimension reduction are necessary steps; (ii) training/testing phase: at this step, the classifiers are selected
and the model is built; and (iii) pattern evaluation: to ensure the performance of the selected classifier, k-fold
cross validation with different feature combination is conducted and accuracy of the models are computed.
Current paper tries to draw the researchers' attention to increase their efforts to enhance research in
Arabic language domain. Since the AA problems in Arabic domain are many, current study focuses, on one
hand, on investigating the effect of training set size with short text length. The reason behind such selection is
to provide insights on the way the training set size impacts the accuracy performance. On the other hand, to
help guide others in selecting suitable classifiers based on the size of the training set so that it will give the
optimum result.
In the preparation for this research, there are, to the best of our knowledge, no published works or
researches in this direction or even in other languages. Since there are numerous methods which have been
developed to tackle the AA problem, the researcher limits his investigation to selected methods and writing
style features; however, the idea is still valid to assess other classifiers with different writing style
combination.
This paper is organized as follows: Section II gives a general overview of the authorship attribution
problem, the selected writing styles, and the selected classifiers. Section III presents the conducted
experiment and finding results and Section IV provides some comparison with other existing works in term
of accuracy. Finally, Section V summarizes the whole paper.
2. AUTHORSHIP ATTRIBUTION
2.1 The selected writing style features
To identify the author of an unseen text, set of popular features are extracted in order to quantify the
writing style. In the literature, several features can be found. The interested reader is referred to [1], [5], [7]
and [8]. These features can be categorized into seven main groups: lexical, character, syntactic, semantic,
content-specific, structural and language-specific. In this paper, the lexical features, namely word-based
features have been selected. The features were combined with n-gram method and Part-of-Speech and
organized different sets for testing as follows:
Set1: Word-based Lexical Features (WLF).
Set2: N-Gram.
Set3: Part-of-Speech (POS)
Set4: N-gram + POS
Set5: WLF+N-gram
Set6: WLF+POS
Set7: WLF+N-gram+ POS
2.2 The selected classifiers
Since there are various machine learning methods that can be used for AA and due to many
classifiers can be selected, our selection was randomly and the following methods were used: Mahalanobis
distance (MD), Linear Regression (LR), and Multilayer Perceptron (MP).
2.2.1. Mahalanobis distance
The Mahalanobis distance is a measure of distance between an observation 𝑥⃗ = (𝑥 , 𝑥 , 𝑥 , … , 𝑥 ) ,
a set of its means 𝜇⃗ = (𝜇 , 𝜇 , 𝜇 , … , 𝜇 ) and a distribution D ( covariance matrix S ) as follows:
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 9, No. 1, February 2019 : 652 - 659
654
)()()( 1
  
xSxxD T
M (1)
It can also be defined as a dissimilarity measure between two random vectors x and
y of the
same distribution with the covariance matrix S:
)()(),( 1
yxSyxyxd T
 
(2)
In the cases the where the covariance matrix is the identity matrix, the Mahalanobis distance is
reduced to the Euclidean distance. In addition to this, if the covariance matrix is diagonal, then the resulting
distance measure is called a normalized Euclidean distance:



N
i i
ii
s
yx
yxd
1
2
2
)(
),( (3)
where is is the standard deviation of the ix and iy over the sample set. The Mahanalobis distance is
widely used in LDA, Clustering analysis, and classification techniques.
2.2.2. Linear regression
LR is one of the oldest methods that is widely used in predictive analysis. The main idea behind it is
to minimize the sum of the squared errors to fit a straight line to a set of data points. The LR model assumes
that the relationship between the dependent variable iy and the p-vector of regressors ix is linear.
Formally, the model takes the form:
  Xy (4)
where,
y is called the regressand or response variable, X are regressors or predictor variables,  are a
"estimated effects" or regression coefficients, and is an error term.
2.2.3. Multilayer perceptron
The MLP is a classical neural network classifier in which the errors of the output are used to train
the network [9]. When applying MLP classifier, we should take in consideration the high computational
effort for training [10] and problem falling in a local minima which leads to some errors of classification. In
nature, the MLP is a class of feed forward network. It consists of three layers of nodes with at least one or
more hidden layer(s). To ensure training of the networks, several back- propagation techniques have been
utilized. Figure 1 represents a MLP with a single hidden layer (interested reader can refer [11].
Figure 1. A MLP with single hidden layer
Int J Elec & Comp Eng ISSN: 2088-8708 
The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem)
655
3. EXPERIMENT AND FINDING RESULTS
3.1. Dataset
Our dataset corpus consists of 4631 texts documents which were extracted from Dar Al-ifta AL
Misriyyah website. The texts include Islamic fatwas from 1896 to 1996. The original texts were unstructured
and required special tool to extract the fatwa structure from the whole documents. For this purpose, a C# tool
was implemented. Since we seek to investigate the size and word length effects on AA classifiers, we
grouped fatwas by the word length and size into different groups. Then, a set with the smallest test size is
used later during the experiment. Table 3 summarizes the statistical features of the used dataset. The word
size ranges widely. It varies from 11 words to 1500 words per texts.
3.2. Design of experiment
As mentioned previously, AA problem can be considered a supervised learning task which means
there is a need to train the model, and then test it on a set of unseen texts. Despite the whole text mining
process, we focus here only on the training/test phases and evaluate the performance of the classifiers by the
number of right predicted texts. As said earlier, a combination of WLF, N-gram, and POS features are used
and three types of classifiers are employed (i.e., Mahalanobis distance, MLP, and Linear Regression).
Figure 2 presents in addition to the typical solution of an AA problem, the way that we are followed
during the experiment (the colored lines in red). Below, the general procedure of the experiment:
a. After executing the whole training and attributing phase, the finding results are recorded. Then, at the
next iteration, we increase the training set size and the whole procedure is executed again and again. We
limit the experiment with 4-fold iterations. However, there is a possibility to increase number of iterations
based on the partitioned training sets.
b. The classifier type is still the same until we investigate all the partitioned training sets.
c. The all writing feature combinations are examined, and then the accuracy of the model is recorded.
To avoid a bias that may occur, the partitioned training sets were collected carefully and t-tests were
conducted to ensure that the training sets are not significant differ. The training sets were organized as
follows:
a. TS1: the training set with 330 documents
b. TS2: the training set with 378 documents
c. TS3: the training set with 414 documents
d. TS4: the training set with 438 documents
Figure 2. Design of experiment
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 9, No. 1, February 2019 : 652 - 659
656
3.3. Experimental findings
3.3.1. Authorship attribution using WLS features
We recorded the accuracy performance after employing the classifiers with different training sets as
shown in Table 1. As we can see, the classifiers performed differently within the training sets. In almost data
sets, the MLP overcomes others classifiers when the training sets have been increased.
Table 1. Performance of classifiers with WLF features
Classifier Type TS4 TS3 TS2 TS1
MLP 77.08% 91.67% 79.167% 75%
LR 72.916% 91.67% 70.83% 83.33%
MD 64.58% 66.67% 37.5% 58.33%
Interesting finding is notated that accuracy of the MLP classifier increased with increasing the
training set size, then decreased vastly. The same thing can be said about LR and MD methods with some
nuance change. The LR performed better than MD. In addition, it gives also better results where the training
set is quite few.
3.3.2. Authorship attribution using N-gram features
To investigate impact of the training sets size on AA using word N-gram. Indeed, the value of the
gram can be selected randomly. However, to capture more words, we adjusted the n-gram to be five. Table 2
shows the performance of classifiers regarding each training set. Table 3 summarizes the statistical features
of the used dataset. The word size ranges widely. It varies from 11 words to 1500 words per texts.
Table 2. Performance of classifiers with N-gram
Classifier Type TS4 TS3 TS2 TS1
MLP 89.58% 91.67% 91.67% 100%
LR 89.58% 91.67% 87.5% 100%
MD 95.83% 94.44% 95.83% 100%
Table 3. Statistical features of dataset
Statistical
Features
Textual Data Feature
# words # sentences Avg. words/sent. # paragraphs Avg. words length
Mean 125.26 4.51 28.04 12.4975 4.514755
St. Dev. 147.73 3.2850 22.115 12.5138 0.2897
Variance 21824.9 10.79 489.079 156.596 0.08395
Max. 1844 31 195.5 160 5.585
Min. 11 2 4.75 160 3.761
Comparing the classifiers' performance with n-gram with their performance with WLF, we note an
improvement in AA accuracy. However, with increase the training set size the performance deteriorated.
Furthermore, the MD method overcomes other classifiers even with increasing the training set size as shown
in Figure 3.
Figure 3. Classifier Performance with N-gram feature
Int J Elec & Comp Eng ISSN: 2088-8708 
The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem)
657
3.3.3. Authorship attribution using POS features
With POS method, the performance of the classifiers are different. Both the accuracy of MLP and
LR methods increases with increasing the training set size until a certain point, then decreases again with
increasing the size. The MLP classifier decreases significantly comparing with LR. On the opposite, the MD
method shows a different behavior. The accuracy of MD classifier is still better than other classifiers as
shown in Figure 4.
Figure 4. Classifier performance with POS
3.3.4. Authorship attribution with all features
In this part, we investigate performance of the classifiers with different combination. Table 4
summarizes all the finding results with different combinations. In most cases, the LR overcomes the other
classifiers. The MD provides low performance accuracy, especially when the WLF features combined with
POS. In general, combining POS with N-grams enhances the performance of attribution task. However, the
accuracy is still lower than the WLF and n-gram combination. It is also notable that the classifier's
performance affected negatively with increasing the training set size. In contrary, applying the full
combination of WLF, N-grams, and POS lead to decreasing the accuracy of AA.
Table 4. Performance of classifiers with different combination
Training Sets (TS)
Methods
TS1TS2TS3TS4
69.44%83.33%91.67%83.33%MLPWLF+N-gram
62.5%83.33%91.67%66.67%LR
41.67%60.42%61.11%50%MD
58.33%66.67%83.33%79.167%MLPWLF+ POS
48.61%70.83%86.11%79.167%LR
34.22%41.67%47.22%37.5%MD
66.67%83.33%69.44%87.5%MLPAll
62.5%87.5%94.44%62.5%LR
30.56%56.25%61.11%41.67%MD
79.17%81.25%91.67%91.67%MLPn-gram + POS
86.11%79.17%91.67%87.5%LR
62.5%62.5%77.78%75%MD
4. COMPARISON WITH OTHER WORKS
As conclusion of the previous sections, we found that combining word-based stylomatric features
with n-grams provides the best accuracy reached in average 93%. Thus, current section compares
performances of those classifiers that are used in the reviewed works with our findings in term of accuracy.
Even if they obtained come from different datasets, the comparison gives an indication of the performance of
the different methods. Table 5 presents the comparison regardless the text size. It presents accuracy of the
used classifiers of those works mentioned throughout this paper regardless the text size, whilst Table 6
present the accuracy taking in consideration the same text size.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 9, No. 1, February 2019 : 652 - 659
658
Table 5. Comparison of the best reached accuracy in this work with accuracy of other methods regardless the
text size
Reference Accuracy Data
Our Work 100% Islamic Fatwas collected from Dar Al-ifta AL Misriyyah website
Shaker and Corne [2] 87.63% Arabic books obtained from the website of the Arab Writers Union
Abbasi and Chen [3] 85.4% Arabic web forum messages from Yahoo groups
Stamatatos [4] 93.4% Arabic newspaper report of Alhayat
Ouamour et al. [5] 100% Ancient historical books in Arabic
Altheneyan and El Bachir Menai [12] 97,84% Arabic books collected from Alwaraq website
Table 6. Comparison of the best reached accuracy in this workwith accuracy of other methods regarding the
text size
Reference Accuracy Data Size
Our Work 93% [11-1500] words per document
Ouamour et al. [5] 80% [100-1500] words per document
Al-Ayyoub et al. [8] 59.8% Up to 140 words per tweets
5. CONCLUSIONS AND FUTURE WORK
This study has investigated the effect of increasing training set size on performance of authorship
attribution classifiers. The experiment was designed to measure accuracy of classifiers with short Arabic
texts. It was also designed to investigate the performance of AA classifier using different combination of
stylomatric features. We limited our experiment to assess three well-known classifiers, namely linear
regression, multilayer perceptron, and Mahanolobis distance. However, the research methodology is still
valid with other classifiers.
The overall results show that the classifiers have different behaviors regarding the used AA features
and training set size. The MLP classifier exceeds the other classifiers when the WLF features are used. The
MLP, to a certain point, is positively affected by the increase in the training set size, then decreased. The n-
gram methods lead to decreasing the classifiers' performance with increasing the training set size. However,
generally speaking, the n-gram features provide the best results among the all stylomatric features. With POS
features, the classifiers show different behaviors. The MLP classifier decreases significantly compared with
LR. On the contrary, the MD shows a different behaviors and provides better accuracy than MLP and LR.
We also applied the classifiers with different features combination (WLF+ n-Gram, WLF+ POS, n-
Gram+ POS, and a set of all features). The MD provides low accuracy especially when the WLF features
combined with POS. Combining POS with N-grams improves the performance of attribution task. Contrary
to the expected results, applying the full combination of WLF, N-grams, and POS leads to decreasing the
accuracy of AA.
For future work, we intend to extend the experiments to investigate accuracy of other classifiers and
to a larger training set. We also plan to investigate the impact of other feature selection methods on the
performance of the AA problem.
REFERENCES
[1] A. Al-Falahi, M. Ramdani, M. Bellafkih, M. Al-Sarem, "Authorship Attribution in Arabic Poetry' Context Using
Markov Chain classifier," IEEE, 2015.
[2] Shaker, K., Corne, D., “Authorship Attributionin Arabic Using A Hybrid Of Evolutionary Search And Linear
Discriminant Analysis,” In: 2010UK Workshopon Computational Intelligence (UKCI), pp.1–6. 2010,
Doi:10.1109/UKCI.2010.5625580.
[3] Abbasi, A., Chen, H., “Applying Authorship Analysis To Arabic Webcontent,” In: Kantor, P., Muresan, G.,
Roberts, F.,Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (Eds.), Intelligence and Security Informatics, vol.
3495. Springer-Verlag, Berlin, Heidelberg, pp. 183–197, 2005.
[4] Stamatatos, E., “Author Identification: Using Text Sampling To Handle The Class Imbalance Problem,” Inf.
Process. Manage. 44 (2), 790–799, 2008, http://dx.doi.org/10.1016/j.ipm.2007.05.012.
[5] Ouamour, S., Khennouf, S., Bourib, S., Hadjadj, H., & Sayoud, H. “Effect of the Text Size on Stylometry-
Application on Arabic Religious Texts”. In Advanced Computational Methods for Knowledge Engineering pp. 215-
228, Springer, Cham., 2016.
[6] Eder, M., “Does Size Matter? Authorship Attribution, Small Samples, Big Problem,” Lit. Ling. Comput. 2013.
doi:10.1093/llc/fqt066.
[7] Sara El Manar El Bouanani, Ismail Kassou, "Authorship Analysis Studies: A survey," International Journal of
Computer Applications (0975-8887) Volume 86-No 12, January 2014.
[8] M. Al-Ayyoub, Y. Jararweh, A., Rababa'ah, and M., Aldwairi. “Feature Extraction and Selection for Arabic Tweets
Authorship Authentication,” J. Ambient Intell Human Comput. 8:383-393, 2017, DOI 10.1007/s12652-017-0452-1.
Int J Elec & Comp Eng ISSN: 2088-8708 
The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem)
659
[9] Sayoud, H., “Automatic Speaker Recognition-Connexionnist Approach,” PhD thesis, USTHB University, Algiers,
2003.
[10] F.J Tweedie, S. Singh, and D.I. Holmes, "Neural Network Application in Stylometry: The Federalist Papers,"
Computers and the Humanities, vol. 30, pp. 1-10, 1996.
[11] Pal, S. K., & Mitra, S. “Multilayer Perceptron, Fuzzy Sets, and Classification,” IEEE Transactions on neural
networks, 3(5), 683-697, 1992.
[12] Altheneyan AS, “Menai MEB Naïve Bayes Classifiers for Authorship Attribution of Arabic Texts,” J King Saud
Univ Comput Inf Sci, 26 (4):473–484, 2014.
BIOGRAPHIES OF AUTHORS
Mohammed Al-Sarem is an assistant professor of information system at the Taibah University,
Al Madinah Al Monawarah, KSA. He received the PhD in Informatics from Hassan II
University, Mohammadia, Morocco in 2014. His research interests center on E-learning,
educational data mining, Arabic text mining, and intelligent and adaptive systems. He published
several research papers and participated in several local/international conferences
Abdel Hamid Emara received his BSc, MSc, and PhD in Systems and computers engineering
from Al-Azhar university in 1992, 2000, 2006, respectively. He works as a lecturer in Systems
and Computers Engineering Department at Al-Azhar University in Cairo, Egypt. He has
experience of 12 years which includes both academic and research. He is currently an Assistant
Professor in Computer Science Department at University of Taibah at Al- Madinah Al
Monawarah, KSA. He is research interest includes Knowledge Discovery and Data Mining, Data
Security, Data analysis and Artificial intelligence. He has published quality International journal
papers.

More Related Content

What's hot

04. 9990 16097-1-ed (edited arf)
04. 9990 16097-1-ed (edited arf)04. 9990 16097-1-ed (edited arf)
04. 9990 16097-1-ed (edited arf)IAESIJEECS
 
Arabic tweeps dialect prediction based on machine learning approach
Arabic tweeps dialect prediction based on machine learning approach Arabic tweeps dialect prediction based on machine learning approach
Arabic tweeps dialect prediction based on machine learning approach IJECEIAES
 
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
 
A novel method for arabic multi word term extraction
A novel method for arabic multi word term extractionA novel method for arabic multi word term extraction
A novel method for arabic multi word term extractionijdms
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indianeSAT Publishing House
 
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHS
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHSTEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHS
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHSijcsit
 
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...acijjournal
 
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRA NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRcscpconf
 
A New Concept Extraction Method for Ontology Construction From Arabic Text
A New Concept Extraction Method for Ontology Construction From Arabic TextA New Concept Extraction Method for Ontology Construction From Arabic Text
A New Concept Extraction Method for Ontology Construction From Arabic TextCSCJournals
 
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...ijnlc
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
 
Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
 
Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationIJECEIAES
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
 
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESNAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
 
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESA NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
 

What's hot (19)

04. 9990 16097-1-ed (edited arf)
04. 9990 16097-1-ed (edited arf)04. 9990 16097-1-ed (edited arf)
04. 9990 16097-1-ed (edited arf)
 
Arabic tweeps dialect prediction based on machine learning approach
Arabic tweeps dialect prediction based on machine learning approach Arabic tweeps dialect prediction based on machine learning approach
Arabic tweeps dialect prediction based on machine learning approach
 
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
 
Ny3424442448
Ny3424442448Ny3424442448
Ny3424442448
 
A novel method for arabic multi word term extraction
A novel method for arabic multi word term extractionA novel method for arabic multi word term extraction
A novel method for arabic multi word term extraction
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indian
 
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHS
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHSTEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHS
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHS
 
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
 
P33077080
P33077080P33077080
P33077080
 
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRA NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
 
A New Concept Extraction Method for Ontology Construction From Arabic Text
A New Concept Extraction Method for Ontology Construction From Arabic TextA New Concept Extraction Method for Ontology Construction From Arabic Text
A New Concept Extraction Method for Ontology Construction From Arabic Text
 
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
 
L1803058388
L1803058388L1803058388
L1803058388
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
 
Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...
 
Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query Translation
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi Language
 
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESNAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
 
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESA NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
 

Similar to The effect of training set size in authorship attribution: application on short arabic texts

A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
 
Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...IJECEIAES
 
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...ijaia
 
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPMGender and Authorship Categorisation of Arabic Text from Twitter Using PPM
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPMAIRCC Publishing Corporation
 
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPMGender and Authorship Categorisation of Arabic Text from Twitter Using PPM
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPMAIRCC Publishing Corporation
 
An in-depth review on News Classification through NLP
An in-depth review on News Classification through NLPAn in-depth review on News Classification through NLP
An in-depth review on News Classification through NLPIRJET Journal
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfNohaGhoweil
 
Query Answering Approach Based on Document Summarization
Query Answering Approach Based on Document SummarizationQuery Answering Approach Based on Document Summarization
Query Answering Approach Based on Document SummarizationIJMER
 
Writer Identification via CNN Features and SVM
Writer Identification via CNN Features and SVMWriter Identification via CNN Features and SVM
Writer Identification via CNN Features and SVMIRJET Journal
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERkevig
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
 
ATAR: Attention-based LSTM for Arabizi transliteration
ATAR: Attention-based LSTM for Arabizi transliterationATAR: Attention-based LSTM for Arabizi transliteration
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
 
An improved Arabic text classification method using word embedding
An improved Arabic text classification method using word embeddingAn improved Arabic text classification method using word embedding
An improved Arabic text classification method using word embeddingIJECEIAES
 
Project t Proposal Bangla alphabet handwritten recognition using deep learnin...
Project t Proposal Bangla alphabet handwritten recognition using deep learnin...Project t Proposal Bangla alphabet handwritten recognition using deep learnin...
Project t Proposal Bangla alphabet handwritten recognition using deep learnin...KhondokerAbuNaim
 
Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
 
A Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationA Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationIJERD Editor
 

Similar to The effect of training set size in authorship attribution: application on short arabic texts (20)

A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzer
 
Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...
 
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...
 
Text classification
 Text classification Text classification
Text classification
 
Text classification
 Text classification Text classification
Text classification
 
Text classification
 Text classification Text classification
Text classification
 
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPMGender and Authorship Categorisation of Arabic Text from Twitter Using PPM
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM
 
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPMGender and Authorship Categorisation of Arabic Text from Twitter Using PPM
Gender and Authorship Categorisation of Arabic Text from Twitter Using PPM
 
An in-depth review on News Classification through NLP
An in-depth review on News Classification through NLPAn in-depth review on News Classification through NLP
An in-depth review on News Classification through NLP
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdf
 
Query Answering Approach Based on Document Summarization
Query Answering Approach Based on Document SummarizationQuery Answering Approach Based on Document Summarization
Query Answering Approach Based on Document Summarization
 
E43022023
E43022023E43022023
E43022023
 
Writer Identification via CNN Features and SVM
Writer Identification via CNN Features and SVMWriter Identification via CNN Features and SVM
Writer Identification via CNN Features and SVM
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
ATAR: Attention-based LSTM for Arabizi transliteration
ATAR: Attention-based LSTM for Arabizi transliterationATAR: Attention-based LSTM for Arabizi transliteration
ATAR: Attention-based LSTM for Arabizi transliteration
 
An improved Arabic text classification method using word embedding
An improved Arabic text classification method using word embeddingAn improved Arabic text classification method using word embedding
An improved Arabic text classification method using word embedding
 
Project t Proposal Bangla alphabet handwritten recognition using deep learnin...
Project t Proposal Bangla alphabet handwritten recognition using deep learnin...Project t Proposal Bangla alphabet handwritten recognition using deep learnin...
Project t Proposal Bangla alphabet handwritten recognition using deep learnin...
 
Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...
 
A Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationA Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text Summarization
 

More from IJECEIAES

Cloud service ranking with an integration of k-means algorithm and decision-m...
Cloud service ranking with an integration of k-means algorithm and decision-m...Cloud service ranking with an integration of k-means algorithm and decision-m...
Cloud service ranking with an integration of k-means algorithm and decision-m...IJECEIAES
 
Prediction of the risk of developing heart disease using logistic regression
Prediction of the risk of developing heart disease using logistic regressionPrediction of the risk of developing heart disease using logistic regression
Prediction of the risk of developing heart disease using logistic regressionIJECEIAES
 
Predictive analysis of terrorist activities in Thailand's Southern provinces:...
Predictive analysis of terrorist activities in Thailand's Southern provinces:...Predictive analysis of terrorist activities in Thailand's Southern provinces:...
Predictive analysis of terrorist activities in Thailand's Southern provinces:...IJECEIAES
 
Optimal model of vehicular ad-hoc network assisted by unmanned aerial vehicl...
Optimal model of vehicular ad-hoc network assisted by  unmanned aerial vehicl...Optimal model of vehicular ad-hoc network assisted by  unmanned aerial vehicl...
Optimal model of vehicular ad-hoc network assisted by unmanned aerial vehicl...IJECEIAES
 
Improving cyberbullying detection through multi-level machine learning
Improving cyberbullying detection through multi-level machine learningImproving cyberbullying detection through multi-level machine learning
Improving cyberbullying detection through multi-level machine learningIJECEIAES
 
Comparison of time series temperature prediction with autoregressive integrat...
Comparison of time series temperature prediction with autoregressive integrat...Comparison of time series temperature prediction with autoregressive integrat...
Comparison of time series temperature prediction with autoregressive integrat...IJECEIAES
 
Strengthening data integrity in academic document recording with blockchain a...
Strengthening data integrity in academic document recording with blockchain a...Strengthening data integrity in academic document recording with blockchain a...
Strengthening data integrity in academic document recording with blockchain a...IJECEIAES
 
Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...IJECEIAES
 
Detection of diseases in rice leaf using convolutional neural network with tr...
Detection of diseases in rice leaf using convolutional neural network with tr...Detection of diseases in rice leaf using convolutional neural network with tr...
Detection of diseases in rice leaf using convolutional neural network with tr...IJECEIAES
 
A systematic review of in-memory database over multi-tenancy
A systematic review of in-memory database over multi-tenancyA systematic review of in-memory database over multi-tenancy
A systematic review of in-memory database over multi-tenancyIJECEIAES
 
Agriculture crop yield prediction using inertia based cat swarm optimization
Agriculture crop yield prediction using inertia based cat swarm optimizationAgriculture crop yield prediction using inertia based cat swarm optimization
Agriculture crop yield prediction using inertia based cat swarm optimizationIJECEIAES
 
Three layer hybrid learning to improve intrusion detection system performance
Three layer hybrid learning to improve intrusion detection system performanceThree layer hybrid learning to improve intrusion detection system performance
Three layer hybrid learning to improve intrusion detection system performanceIJECEIAES
 
Non-binary codes approach on the performance of short-packet full-duplex tran...
Non-binary codes approach on the performance of short-packet full-duplex tran...Non-binary codes approach on the performance of short-packet full-duplex tran...
Non-binary codes approach on the performance of short-packet full-duplex tran...IJECEIAES
 
Improved design and performance of the global rectenna system for wireless po...
Improved design and performance of the global rectenna system for wireless po...Improved design and performance of the global rectenna system for wireless po...
Improved design and performance of the global rectenna system for wireless po...IJECEIAES
 
Advanced hybrid algorithms for precise multipath channel estimation in next-g...
Advanced hybrid algorithms for precise multipath channel estimation in next-g...Advanced hybrid algorithms for precise multipath channel estimation in next-g...
Advanced hybrid algorithms for precise multipath channel estimation in next-g...IJECEIAES
 
Performance analysis of 2D optical code division multiple access through unde...
Performance analysis of 2D optical code division multiple access through unde...Performance analysis of 2D optical code division multiple access through unde...
Performance analysis of 2D optical code division multiple access through unde...IJECEIAES
 
On performance analysis of non-orthogonal multiple access downlink for cellul...
On performance analysis of non-orthogonal multiple access downlink for cellul...On performance analysis of non-orthogonal multiple access downlink for cellul...
On performance analysis of non-orthogonal multiple access downlink for cellul...IJECEIAES
 
Phase delay through slot-line beam switching microstrip patch array antenna d...
Phase delay through slot-line beam switching microstrip patch array antenna d...Phase delay through slot-line beam switching microstrip patch array antenna d...
Phase delay through slot-line beam switching microstrip patch array antenna d...IJECEIAES
 
A simple feed orthogonal excitation X-band dual circular polarized microstrip...
A simple feed orthogonal excitation X-band dual circular polarized microstrip...A simple feed orthogonal excitation X-band dual circular polarized microstrip...
A simple feed orthogonal excitation X-band dual circular polarized microstrip...IJECEIAES
 
A taxonomy on power optimization techniques for fifthgeneration heterogenous ...
A taxonomy on power optimization techniques for fifthgeneration heterogenous ...A taxonomy on power optimization techniques for fifthgeneration heterogenous ...
A taxonomy on power optimization techniques for fifthgeneration heterogenous ...IJECEIAES
 

More from IJECEIAES (20)

Cloud service ranking with an integration of k-means algorithm and decision-m...
Cloud service ranking with an integration of k-means algorithm and decision-m...Cloud service ranking with an integration of k-means algorithm and decision-m...
Cloud service ranking with an integration of k-means algorithm and decision-m...
 
Prediction of the risk of developing heart disease using logistic regression
Prediction of the risk of developing heart disease using logistic regressionPrediction of the risk of developing heart disease using logistic regression
Prediction of the risk of developing heart disease using logistic regression
 
Predictive analysis of terrorist activities in Thailand's Southern provinces:...
Predictive analysis of terrorist activities in Thailand's Southern provinces:...Predictive analysis of terrorist activities in Thailand's Southern provinces:...
Predictive analysis of terrorist activities in Thailand's Southern provinces:...
 
Optimal model of vehicular ad-hoc network assisted by unmanned aerial vehicl...
Optimal model of vehicular ad-hoc network assisted by  unmanned aerial vehicl...Optimal model of vehicular ad-hoc network assisted by  unmanned aerial vehicl...
Optimal model of vehicular ad-hoc network assisted by unmanned aerial vehicl...
 
Improving cyberbullying detection through multi-level machine learning
Improving cyberbullying detection through multi-level machine learningImproving cyberbullying detection through multi-level machine learning
Improving cyberbullying detection through multi-level machine learning
 
Comparison of time series temperature prediction with autoregressive integrat...
Comparison of time series temperature prediction with autoregressive integrat...Comparison of time series temperature prediction with autoregressive integrat...
Comparison of time series temperature prediction with autoregressive integrat...
 
Strengthening data integrity in academic document recording with blockchain a...
Strengthening data integrity in academic document recording with blockchain a...Strengthening data integrity in academic document recording with blockchain a...
Strengthening data integrity in academic document recording with blockchain a...
 
Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...
 
Detection of diseases in rice leaf using convolutional neural network with tr...
Detection of diseases in rice leaf using convolutional neural network with tr...Detection of diseases in rice leaf using convolutional neural network with tr...
Detection of diseases in rice leaf using convolutional neural network with tr...
 
A systematic review of in-memory database over multi-tenancy
A systematic review of in-memory database over multi-tenancyA systematic review of in-memory database over multi-tenancy
A systematic review of in-memory database over multi-tenancy
 
Agriculture crop yield prediction using inertia based cat swarm optimization
Agriculture crop yield prediction using inertia based cat swarm optimizationAgriculture crop yield prediction using inertia based cat swarm optimization
Agriculture crop yield prediction using inertia based cat swarm optimization
 
Three layer hybrid learning to improve intrusion detection system performance
Three layer hybrid learning to improve intrusion detection system performanceThree layer hybrid learning to improve intrusion detection system performance
Three layer hybrid learning to improve intrusion detection system performance
 
Non-binary codes approach on the performance of short-packet full-duplex tran...
Non-binary codes approach on the performance of short-packet full-duplex tran...Non-binary codes approach on the performance of short-packet full-duplex tran...
Non-binary codes approach on the performance of short-packet full-duplex tran...
 
Improved design and performance of the global rectenna system for wireless po...
Improved design and performance of the global rectenna system for wireless po...Improved design and performance of the global rectenna system for wireless po...
Improved design and performance of the global rectenna system for wireless po...
 
Advanced hybrid algorithms for precise multipath channel estimation in next-g...
Advanced hybrid algorithms for precise multipath channel estimation in next-g...Advanced hybrid algorithms for precise multipath channel estimation in next-g...
Advanced hybrid algorithms for precise multipath channel estimation in next-g...
 
Performance analysis of 2D optical code division multiple access through unde...
Performance analysis of 2D optical code division multiple access through unde...Performance analysis of 2D optical code division multiple access through unde...
Performance analysis of 2D optical code division multiple access through unde...
 
On performance analysis of non-orthogonal multiple access downlink for cellul...
On performance analysis of non-orthogonal multiple access downlink for cellul...On performance analysis of non-orthogonal multiple access downlink for cellul...
On performance analysis of non-orthogonal multiple access downlink for cellul...
 
Phase delay through slot-line beam switching microstrip patch array antenna d...
Phase delay through slot-line beam switching microstrip patch array antenna d...Phase delay through slot-line beam switching microstrip patch array antenna d...
Phase delay through slot-line beam switching microstrip patch array antenna d...
 
A simple feed orthogonal excitation X-band dual circular polarized microstrip...
A simple feed orthogonal excitation X-band dual circular polarized microstrip...A simple feed orthogonal excitation X-band dual circular polarized microstrip...
A simple feed orthogonal excitation X-band dual circular polarized microstrip...
 
A taxonomy on power optimization techniques for fifthgeneration heterogenous ...
A taxonomy on power optimization techniques for fifthgeneration heterogenous ...A taxonomy on power optimization techniques for fifthgeneration heterogenous ...
A taxonomy on power optimization techniques for fifthgeneration heterogenous ...
 

Recently uploaded

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 

Recently uploaded (20)

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 

The effect of training set size in authorship attribution: application on short arabic texts

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 9, No. 1, February 2019, pp. 652~659 ISSN: 2088-8708, DOI: 10.11591/ijece.v9i1.pp652-659  652 Journal homepage: http://iaescore.com/journals/index.php/IJECE The effect of training set size in authorship attribution: application on short Arabic texts Mohammed AL-Sarem1 , Abdel-Hamid Emara2 1Department of Information System, Taibah University, Madinah, KSA 1Department of Computer Science, Saba'a Region University, Mareb, Yemen 2Department of Computer Science, Taibah University, Madinah, KSA 2Computers and Systems Engineering Department, Al-Azhar University, Egypt Article Info ABSTRACT Article history: Received Apr 14, 2018 Revised Sep 3, 2018 Accepted Sep 26, 2018 Authorship attribution (AA) is a subfield of linguistics analysis, aiming to identify the original author among a set of candidate authors. Several research papers were published and several methods and models were developed for many languages. However, the number of related works for Arabic is limited. Moreover, investigating the impact of short words length and training set size is not well addressed. To the best of our knowledge, no published works or researches, in this direction or even in other languages, are available. Therefore, we propose to investigate this effect, taking into account different stylomatric combination. The Mahalanobis distance (MD), Linear Regression (LR), and Multilayer Perceptron (MP) are selected as AA classifiers. During the experiment, the training dataset size is increased and the accuracy of the classifiers is recorded. The results are quite interesting and show different classifiers behaviors. Combining word-based stylomatric features with n-grams provides the best accuracy reached in average 93%. Keywords: Arabic language Authorship attribution training set size Linear regression Mahalonobis distance MLP classifier Copyright © 2019 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Mohammed Al-Sarem, Department of Information System, Taibah University, PO box 344, Medina, Kingdom of Saudi Arabia. Email: mohsarem@gmail.com 1. INTRODUCTION Unlike text categorization, authorship attribution (AA) is a subfield of linguistics analysis which aims to identify the original author among a set of candidate authors [1]. The idea can be created basically as follows: for given text sets of known authorship, the author of the unseen text is estimated by matching the anonymous text to one author of the candidate set. The estimated function depends on the extracted human “stylometric fingerprint” features. Despite author's writing style that can change from topic to topic, some persistent, uncontrolled habit and writing styles are still valid over the time. Although the importance of Arabic language and the global position which it holds and the widely use by hundreds of millions of people, only a very small number of authorship attribution studies that employ machine-learning methods has been published for Arabic texts so [2]. Abbasi and Chen, e.g., applied support vector machine (SVM) and C4.5 decision trees on political and social Arabic web forum messages [3]. Despite the notable results they have, the dataset is quite large to extract enough features. Stamatatos [4] tested also the use of SVM on Arabic newspaper. Although the experiment was conducted to investigate the effect of class imbalance problem, the findings encouraged us to investigate which size and length effect the performance of AA classifiers has. In [2], the linear discriminant analysis (LDA) is used. For attributing text, function words were used as feature. At training and testing phase, the used texts were divided into two chunks: the first chunk with 1000-words, while the second chunk with 2000-words. The findings indicate that the longer the text is, the higher the obtained performance. The best performance they obtained with the
  • 2. Int J Elec & Comp Eng ISSN: 2088-8708  The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem) 653 second chunk was around 87% accuracy. The same findings were reported in [5] in which the authors examined the performance of several classifiers, namely, SVM, MLP, Linear Regression, Stamatatos distance and Manhattan distance. The results confirmed the findings by Eder in 2013 for the English language [6] and the minimum size of textual data is 2500 words per documents. Existing works on authorship attribution are conducted with long text size. Besides that, there are many problems facing researchers in this domain, especially in Arabic: a. Researches in AA are quite few and are not tackled as much as in other languages; b. Absence a benchmark data sets, full adaptable support tools lead to increase the exerted effort in extracting the necessary features; c. Nature of Arabic language also adds extra steps to text preprocessing, and words in Arabic tend to be short, which might reduce the effectiveness of some stylistic features, such as word length distribution [3]. Like any supervised learning problem, solving the AA problems go through three key phases: (i) data preprocessing phase where the typical dataset collection, feature extraction, feature selection, and dimension reduction are necessary steps; (ii) training/testing phase: at this step, the classifiers are selected and the model is built; and (iii) pattern evaluation: to ensure the performance of the selected classifier, k-fold cross validation with different feature combination is conducted and accuracy of the models are computed. Current paper tries to draw the researchers' attention to increase their efforts to enhance research in Arabic language domain. Since the AA problems in Arabic domain are many, current study focuses, on one hand, on investigating the effect of training set size with short text length. The reason behind such selection is to provide insights on the way the training set size impacts the accuracy performance. On the other hand, to help guide others in selecting suitable classifiers based on the size of the training set so that it will give the optimum result. In the preparation for this research, there are, to the best of our knowledge, no published works or researches in this direction or even in other languages. Since there are numerous methods which have been developed to tackle the AA problem, the researcher limits his investigation to selected methods and writing style features; however, the idea is still valid to assess other classifiers with different writing style combination. This paper is organized as follows: Section II gives a general overview of the authorship attribution problem, the selected writing styles, and the selected classifiers. Section III presents the conducted experiment and finding results and Section IV provides some comparison with other existing works in term of accuracy. Finally, Section V summarizes the whole paper. 2. AUTHORSHIP ATTRIBUTION 2.1 The selected writing style features To identify the author of an unseen text, set of popular features are extracted in order to quantify the writing style. In the literature, several features can be found. The interested reader is referred to [1], [5], [7] and [8]. These features can be categorized into seven main groups: lexical, character, syntactic, semantic, content-specific, structural and language-specific. In this paper, the lexical features, namely word-based features have been selected. The features were combined with n-gram method and Part-of-Speech and organized different sets for testing as follows: Set1: Word-based Lexical Features (WLF). Set2: N-Gram. Set3: Part-of-Speech (POS) Set4: N-gram + POS Set5: WLF+N-gram Set6: WLF+POS Set7: WLF+N-gram+ POS 2.2 The selected classifiers Since there are various machine learning methods that can be used for AA and due to many classifiers can be selected, our selection was randomly and the following methods were used: Mahalanobis distance (MD), Linear Regression (LR), and Multilayer Perceptron (MP). 2.2.1. Mahalanobis distance The Mahalanobis distance is a measure of distance between an observation 𝑥⃗ = (𝑥 , 𝑥 , 𝑥 , … , 𝑥 ) , a set of its means 𝜇⃗ = (𝜇 , 𝜇 , 𝜇 , … , 𝜇 ) and a distribution D ( covariance matrix S ) as follows:
  • 3.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 9, No. 1, February 2019 : 652 - 659 654 )()()( 1    xSxxD T M (1) It can also be defined as a dissimilarity measure between two random vectors x and y of the same distribution with the covariance matrix S: )()(),( 1 yxSyxyxd T   (2) In the cases the where the covariance matrix is the identity matrix, the Mahalanobis distance is reduced to the Euclidean distance. In addition to this, if the covariance matrix is diagonal, then the resulting distance measure is called a normalized Euclidean distance:    N i i ii s yx yxd 1 2 2 )( ),( (3) where is is the standard deviation of the ix and iy over the sample set. The Mahanalobis distance is widely used in LDA, Clustering analysis, and classification techniques. 2.2.2. Linear regression LR is one of the oldest methods that is widely used in predictive analysis. The main idea behind it is to minimize the sum of the squared errors to fit a straight line to a set of data points. The LR model assumes that the relationship between the dependent variable iy and the p-vector of regressors ix is linear. Formally, the model takes the form:   Xy (4) where, y is called the regressand or response variable, X are regressors or predictor variables,  are a "estimated effects" or regression coefficients, and is an error term. 2.2.3. Multilayer perceptron The MLP is a classical neural network classifier in which the errors of the output are used to train the network [9]. When applying MLP classifier, we should take in consideration the high computational effort for training [10] and problem falling in a local minima which leads to some errors of classification. In nature, the MLP is a class of feed forward network. It consists of three layers of nodes with at least one or more hidden layer(s). To ensure training of the networks, several back- propagation techniques have been utilized. Figure 1 represents a MLP with a single hidden layer (interested reader can refer [11]. Figure 1. A MLP with single hidden layer
  • 4. Int J Elec & Comp Eng ISSN: 2088-8708  The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem) 655 3. EXPERIMENT AND FINDING RESULTS 3.1. Dataset Our dataset corpus consists of 4631 texts documents which were extracted from Dar Al-ifta AL Misriyyah website. The texts include Islamic fatwas from 1896 to 1996. The original texts were unstructured and required special tool to extract the fatwa structure from the whole documents. For this purpose, a C# tool was implemented. Since we seek to investigate the size and word length effects on AA classifiers, we grouped fatwas by the word length and size into different groups. Then, a set with the smallest test size is used later during the experiment. Table 3 summarizes the statistical features of the used dataset. The word size ranges widely. It varies from 11 words to 1500 words per texts. 3.2. Design of experiment As mentioned previously, AA problem can be considered a supervised learning task which means there is a need to train the model, and then test it on a set of unseen texts. Despite the whole text mining process, we focus here only on the training/test phases and evaluate the performance of the classifiers by the number of right predicted texts. As said earlier, a combination of WLF, N-gram, and POS features are used and three types of classifiers are employed (i.e., Mahalanobis distance, MLP, and Linear Regression). Figure 2 presents in addition to the typical solution of an AA problem, the way that we are followed during the experiment (the colored lines in red). Below, the general procedure of the experiment: a. After executing the whole training and attributing phase, the finding results are recorded. Then, at the next iteration, we increase the training set size and the whole procedure is executed again and again. We limit the experiment with 4-fold iterations. However, there is a possibility to increase number of iterations based on the partitioned training sets. b. The classifier type is still the same until we investigate all the partitioned training sets. c. The all writing feature combinations are examined, and then the accuracy of the model is recorded. To avoid a bias that may occur, the partitioned training sets were collected carefully and t-tests were conducted to ensure that the training sets are not significant differ. The training sets were organized as follows: a. TS1: the training set with 330 documents b. TS2: the training set with 378 documents c. TS3: the training set with 414 documents d. TS4: the training set with 438 documents Figure 2. Design of experiment
  • 5.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 9, No. 1, February 2019 : 652 - 659 656 3.3. Experimental findings 3.3.1. Authorship attribution using WLS features We recorded the accuracy performance after employing the classifiers with different training sets as shown in Table 1. As we can see, the classifiers performed differently within the training sets. In almost data sets, the MLP overcomes others classifiers when the training sets have been increased. Table 1. Performance of classifiers with WLF features Classifier Type TS4 TS3 TS2 TS1 MLP 77.08% 91.67% 79.167% 75% LR 72.916% 91.67% 70.83% 83.33% MD 64.58% 66.67% 37.5% 58.33% Interesting finding is notated that accuracy of the MLP classifier increased with increasing the training set size, then decreased vastly. The same thing can be said about LR and MD methods with some nuance change. The LR performed better than MD. In addition, it gives also better results where the training set is quite few. 3.3.2. Authorship attribution using N-gram features To investigate impact of the training sets size on AA using word N-gram. Indeed, the value of the gram can be selected randomly. However, to capture more words, we adjusted the n-gram to be five. Table 2 shows the performance of classifiers regarding each training set. Table 3 summarizes the statistical features of the used dataset. The word size ranges widely. It varies from 11 words to 1500 words per texts. Table 2. Performance of classifiers with N-gram Classifier Type TS4 TS3 TS2 TS1 MLP 89.58% 91.67% 91.67% 100% LR 89.58% 91.67% 87.5% 100% MD 95.83% 94.44% 95.83% 100% Table 3. Statistical features of dataset Statistical Features Textual Data Feature # words # sentences Avg. words/sent. # paragraphs Avg. words length Mean 125.26 4.51 28.04 12.4975 4.514755 St. Dev. 147.73 3.2850 22.115 12.5138 0.2897 Variance 21824.9 10.79 489.079 156.596 0.08395 Max. 1844 31 195.5 160 5.585 Min. 11 2 4.75 160 3.761 Comparing the classifiers' performance with n-gram with their performance with WLF, we note an improvement in AA accuracy. However, with increase the training set size the performance deteriorated. Furthermore, the MD method overcomes other classifiers even with increasing the training set size as shown in Figure 3. Figure 3. Classifier Performance with N-gram feature
  • 6. Int J Elec & Comp Eng ISSN: 2088-8708  The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem) 657 3.3.3. Authorship attribution using POS features With POS method, the performance of the classifiers are different. Both the accuracy of MLP and LR methods increases with increasing the training set size until a certain point, then decreases again with increasing the size. The MLP classifier decreases significantly comparing with LR. On the opposite, the MD method shows a different behavior. The accuracy of MD classifier is still better than other classifiers as shown in Figure 4. Figure 4. Classifier performance with POS 3.3.4. Authorship attribution with all features In this part, we investigate performance of the classifiers with different combination. Table 4 summarizes all the finding results with different combinations. In most cases, the LR overcomes the other classifiers. The MD provides low performance accuracy, especially when the WLF features combined with POS. In general, combining POS with N-grams enhances the performance of attribution task. However, the accuracy is still lower than the WLF and n-gram combination. It is also notable that the classifier's performance affected negatively with increasing the training set size. In contrary, applying the full combination of WLF, N-grams, and POS lead to decreasing the accuracy of AA. Table 4. Performance of classifiers with different combination Training Sets (TS) Methods TS1TS2TS3TS4 69.44%83.33%91.67%83.33%MLPWLF+N-gram 62.5%83.33%91.67%66.67%LR 41.67%60.42%61.11%50%MD 58.33%66.67%83.33%79.167%MLPWLF+ POS 48.61%70.83%86.11%79.167%LR 34.22%41.67%47.22%37.5%MD 66.67%83.33%69.44%87.5%MLPAll 62.5%87.5%94.44%62.5%LR 30.56%56.25%61.11%41.67%MD 79.17%81.25%91.67%91.67%MLPn-gram + POS 86.11%79.17%91.67%87.5%LR 62.5%62.5%77.78%75%MD 4. COMPARISON WITH OTHER WORKS As conclusion of the previous sections, we found that combining word-based stylomatric features with n-grams provides the best accuracy reached in average 93%. Thus, current section compares performances of those classifiers that are used in the reviewed works with our findings in term of accuracy. Even if they obtained come from different datasets, the comparison gives an indication of the performance of the different methods. Table 5 presents the comparison regardless the text size. It presents accuracy of the used classifiers of those works mentioned throughout this paper regardless the text size, whilst Table 6 present the accuracy taking in consideration the same text size.
  • 7.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 9, No. 1, February 2019 : 652 - 659 658 Table 5. Comparison of the best reached accuracy in this work with accuracy of other methods regardless the text size Reference Accuracy Data Our Work 100% Islamic Fatwas collected from Dar Al-ifta AL Misriyyah website Shaker and Corne [2] 87.63% Arabic books obtained from the website of the Arab Writers Union Abbasi and Chen [3] 85.4% Arabic web forum messages from Yahoo groups Stamatatos [4] 93.4% Arabic newspaper report of Alhayat Ouamour et al. [5] 100% Ancient historical books in Arabic Altheneyan and El Bachir Menai [12] 97,84% Arabic books collected from Alwaraq website Table 6. Comparison of the best reached accuracy in this workwith accuracy of other methods regarding the text size Reference Accuracy Data Size Our Work 93% [11-1500] words per document Ouamour et al. [5] 80% [100-1500] words per document Al-Ayyoub et al. [8] 59.8% Up to 140 words per tweets 5. CONCLUSIONS AND FUTURE WORK This study has investigated the effect of increasing training set size on performance of authorship attribution classifiers. The experiment was designed to measure accuracy of classifiers with short Arabic texts. It was also designed to investigate the performance of AA classifier using different combination of stylomatric features. We limited our experiment to assess three well-known classifiers, namely linear regression, multilayer perceptron, and Mahanolobis distance. However, the research methodology is still valid with other classifiers. The overall results show that the classifiers have different behaviors regarding the used AA features and training set size. The MLP classifier exceeds the other classifiers when the WLF features are used. The MLP, to a certain point, is positively affected by the increase in the training set size, then decreased. The n- gram methods lead to decreasing the classifiers' performance with increasing the training set size. However, generally speaking, the n-gram features provide the best results among the all stylomatric features. With POS features, the classifiers show different behaviors. The MLP classifier decreases significantly compared with LR. On the contrary, the MD shows a different behaviors and provides better accuracy than MLP and LR. We also applied the classifiers with different features combination (WLF+ n-Gram, WLF+ POS, n- Gram+ POS, and a set of all features). The MD provides low accuracy especially when the WLF features combined with POS. Combining POS with N-grams improves the performance of attribution task. Contrary to the expected results, applying the full combination of WLF, N-grams, and POS leads to decreasing the accuracy of AA. For future work, we intend to extend the experiments to investigate accuracy of other classifiers and to a larger training set. We also plan to investigate the impact of other feature selection methods on the performance of the AA problem. REFERENCES [1] A. Al-Falahi, M. Ramdani, M. Bellafkih, M. Al-Sarem, "Authorship Attribution in Arabic Poetry' Context Using Markov Chain classifier," IEEE, 2015. [2] Shaker, K., Corne, D., “Authorship Attributionin Arabic Using A Hybrid Of Evolutionary Search And Linear Discriminant Analysis,” In: 2010UK Workshopon Computational Intelligence (UKCI), pp.1–6. 2010, Doi:10.1109/UKCI.2010.5625580. [3] Abbasi, A., Chen, H., “Applying Authorship Analysis To Arabic Webcontent,” In: Kantor, P., Muresan, G., Roberts, F.,Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (Eds.), Intelligence and Security Informatics, vol. 3495. Springer-Verlag, Berlin, Heidelberg, pp. 183–197, 2005. [4] Stamatatos, E., “Author Identification: Using Text Sampling To Handle The Class Imbalance Problem,” Inf. Process. Manage. 44 (2), 790–799, 2008, http://dx.doi.org/10.1016/j.ipm.2007.05.012. [5] Ouamour, S., Khennouf, S., Bourib, S., Hadjadj, H., & Sayoud, H. “Effect of the Text Size on Stylometry- Application on Arabic Religious Texts”. In Advanced Computational Methods for Knowledge Engineering pp. 215- 228, Springer, Cham., 2016. [6] Eder, M., “Does Size Matter? Authorship Attribution, Small Samples, Big Problem,” Lit. Ling. Comput. 2013. doi:10.1093/llc/fqt066. [7] Sara El Manar El Bouanani, Ismail Kassou, "Authorship Analysis Studies: A survey," International Journal of Computer Applications (0975-8887) Volume 86-No 12, January 2014. [8] M. Al-Ayyoub, Y. Jararweh, A., Rababa'ah, and M., Aldwairi. “Feature Extraction and Selection for Arabic Tweets Authorship Authentication,” J. Ambient Intell Human Comput. 8:383-393, 2017, DOI 10.1007/s12652-017-0452-1.
  • 8. Int J Elec & Comp Eng ISSN: 2088-8708  The effect of training set size in authorship attribution: application on short ... (Mohammed AL-Sarem) 659 [9] Sayoud, H., “Automatic Speaker Recognition-Connexionnist Approach,” PhD thesis, USTHB University, Algiers, 2003. [10] F.J Tweedie, S. Singh, and D.I. Holmes, "Neural Network Application in Stylometry: The Federalist Papers," Computers and the Humanities, vol. 30, pp. 1-10, 1996. [11] Pal, S. K., & Mitra, S. “Multilayer Perceptron, Fuzzy Sets, and Classification,” IEEE Transactions on neural networks, 3(5), 683-697, 1992. [12] Altheneyan AS, “Menai MEB Naïve Bayes Classifiers for Authorship Attribution of Arabic Texts,” J King Saud Univ Comput Inf Sci, 26 (4):473–484, 2014. BIOGRAPHIES OF AUTHORS Mohammed Al-Sarem is an assistant professor of information system at the Taibah University, Al Madinah Al Monawarah, KSA. He received the PhD in Informatics from Hassan II University, Mohammadia, Morocco in 2014. His research interests center on E-learning, educational data mining, Arabic text mining, and intelligent and adaptive systems. He published several research papers and participated in several local/international conferences Abdel Hamid Emara received his BSc, MSc, and PhD in Systems and computers engineering from Al-Azhar university in 1992, 2000, 2006, respectively. He works as a lecturer in Systems and Computers Engineering Department at Al-Azhar University in Cairo, Egypt. He has experience of 12 years which includes both academic and research. He is currently an Assistant Professor in Computer Science Department at University of Taibah at Al- Madinah Al Monawarah, KSA. He is research interest includes Knowledge Discovery and Data Mining, Data Security, Data analysis and Artificial intelligence. He has published quality International journal papers.