4. Quantitative linguistics is
the comparative study of
the frequency and
distribution of words and
syntactic structures in
different texts.
The aim of qualitative
analysis is a complete,
detailed description of the
linguistic data in order to
describe the linguistic
features & phenomena which
are identified in the data.
Those linguistic features
were classified, counted,
and even constructed
more complex statistical
models in an attempt to
explain what is observed.
Quantitative findings can
be generalized to a larger
population, and direct
comparisons can be made
between two corpora.
Thus, quantitative analysis allows
us to discover which phenomena
are likely to be genuine
reflections of the behavior of a
language.
5.
6. to make a comparative quantitative
linguistic analysis of the most common
Egyptian aphorisms between two
corpora by using the statistical system
of R language.
The comparison is done between:
Corpus in
Arabic language Corpus in
English language
7.
8. The linguistic data which collected for the
two corpora are sentences from the
Egyptian aphorisms, from the online
database of Egyptian aphorisms.(Total = 100)
50 sentences of
Egyptian
aphorisms in
Arabic language
50 sentences of
Egyptian
aphorisms in
English language
9. Analyzing the two corpora is done through to steps:
• by using the online Stanford Parser for
marking the part of speech tagging (which
refer to a syntactic function) of each word of
the included two corpora, as well as,
measuring the number of tokens & the taken
time for tagging the words of each aphorism.
First
Automatically
• by adding the descriptions and inflections of
each part of speech tagging with the aid of
the list of the parts of speech encoded in the
annotation system of the Penn Treebank
Project
• Also, analyzing the animacy (animate /
inanimate) and the gender (masculine /
feminine) of each annotated word of the two
corpora of the Egyptian aphorisms.
Second
Manually
10. Includes converting and tabulating all the
analyzed data into excel sheet to be accepted
and read by R language as CSV file.
CSV
a1
a2
e1
e2
11.
12.
13.
14.
15. Running the manipulated data and
making the statistical measurements, to
find the linguistic features of both
corpora (Arabic & English) by using R
language and investigate the
quantitative linguistic (lexical)
characteristics of Egyptian aphorisms in
Arabic and English languages.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33. In order to extract the quantitative linguistic
characteristics of Arabic and English corpora of
Egyptian aphorisms compare between them; this
section is divided into two subsections:
• contains the statistical
measurements that is done
by R statistics.
1) Statistical
measurements
by R:
• contains the visualization of
the output of the R results
by R graphics.
2) Visualizing
the output by
R:
34.
35.
36.
37. For visualizing the word length in English
corpus and in Arabic corpus:
• barplot(xtabs(~ee$Length), xlab= "word length in
English corpus", col= "grey")
• barplot(xtabs(~aa$Length), xlab= "word length in
Arabic corpus", col= "grey")
38. For visualizing the number of tokens of each
sentence of the query in English corpus and in
Arabic corpus:
• barplot(xtabs(~e$Tokens), xlab= "English corpus
tokens numbers", col= "grey")
• barplot(xtabs(~a$Tokens), xlab= "Arabic corpus
tokens numbers", col= "grey")
39. For visualizing the query length in English
corpus and in Arabic corpus, the following
codes are used by R language:
• barplot(xtabs(~e$Length), xlab= "English query
length", col= "grey")
• barplot(xtabs(~a$Length), xlab= "Arabic query length",
col= "grey")
40. For visualizing the Animacy of each word in the
query in English corpus and in Arabic corpus,
the following codes are used by R language:
• barplot(xtabs(~ee$Animacy), xlab= "Animacy in
English corpus", col= "grey")
• barplot(xtabs(~aa$Animacy), xlab= "Animacy in Arabic
corpus", col= "grey")
41. For visualizing the Gender of each word of the
query in English corpus and in Arabic corpus,
the following codes are used by R language:
• barplot(xtabs(~ee$Gender), xlab= "Gender in English
corpus", col= "grey")
• barplot(xtabs(~aa$Gender), xlab= "Gender in Arabic
corpus", col= "grey")
42. For visualizing tokens numbers in English
corpus and in Arabic corpus, the following
codes are used by R language:
• truehist(e$Tokens, col="lightblue", xlab="English
tokens numbers")
• truehist(a$Tokens, col="lightblue", xlab="Arabic
tokens numbers")
43.
44. The mean length of an Egyptian Aphorism in
the Arabic corpus (48.18 per letter) is lesser
than its counterpart in English corpus (50.66
per letter), which means that the Egyptian
aphorism in English language is longer than
its counterpart in Arabic language.
The number of tokens that used in expressing
an Egyptian aphorism in Arabic language is
more than the number of tokens which used in
expressing its English counterpart. Also, the
mean number of tokens of an Arabic Egyptian
aphorism (mean number of tokens = 10.24) is
greater than in an English Egyptian aphorism
(mean number of tokens = 9.9).
45. Minimum and maximum numbers of tokens for
both corpora are quite the same, whereas the
minimum number of tokens is 4 tokens for both
corpora, and the maximum number of tokens is
22 for Arabic corpus & 21 for English corpus.
The mean of words length in Arabic corpus
(3.794922) is lesser than in English corpus
(4.179798), wherein, the range of Arabic words
in Arabic corpus is from 1 to 8, and in English
corpus is from 1 to 14. Although the median
length of words is the same in both corpora
(which calculate 4 letters in both corpora).
46. According to the most general frequent words
in both corpora of Egyptian aphorisms, in
Arabic corpus, the words من,ال,و,ان) ;
respectively from right to left) are the most
frequent words. Whereas, in English corpus the
words (the, is, you, of; respectively from left to
right) are the most frequent words.
Regarding the tag set of Egyptian aphorisms of
both corpora, NN (Noun) are the most frequent
tag for both corpora followed by DT
(Determiner), followed by Verbs and
Prepositions.