SlideShare a Scribd company logo
1 of 191
Download to read offline
Advanced Search/Indexing in Holy-Quran
Assem Chelli
2011 / 2012
Ministry of Higher Education and Scientific Research
National Higher School of Computer Science

Thesis Of Magister
Option : Mobile Destributed Computing (IRM)
Proposal of an Advanced Retrieval System for Noble Qur’an

Written By :

Supervised By :

• Assem CHELLI

• Pr. Amar BALLA
• Mr. Taha ZERROUKI

2011/ 2012
... and say : O my Lord ! have compassion on them, as they brought me up (when I was)
little. – Al-isra’ 24

iii
Acknowledgment

First at all, I am thanking Allah, the Almighty for giving me strength and patience to
write this modest thesis.
We gratefully acknowledge Pr. Amar Balla and Mr. Taha Zerrouki for giving me the
honor of their supervising during that year and guiding me with advices, and meaningful
criticism.
I also thank the jury members for agreeing to evaluate our modest work.
A big thanks to the faculty and the administration of the National Higher School of
Computer Science (ESI) who took care of my training and monitoring throughout the
study program.

Our deepest thanks go to the Arab open source community that has offered me a great
support, especially Alfanous Team/Community that makes a valuable contribution to
carry out this great work, and I hope to be worthy of the confidence they have placed on
me.
Finally I express my appreciation to all who contributed by their advice and their
encouragement to the completion of this work, my family and my friends for their
assistance and support.

iv
Abstract
Noble Quran is different of all documents that we have known. It’s the sacred book
of Muslims. It contains knowledge of all aspects of life. With this huge quantity of
information, we can extract only a small part manually and this is considered insufficient compared to the size of knowledge contained by Quran. That raises the need for
a method to extract those information because currently there is no efficient method
except many printed lexicons and many tools of simple sequential search with regular
expression. Due to this limitation, the Quran requires us to find new ways to interact.
The goal through this work is to propose a system for advanced research in all of
the information contained in the Quran by considering the morphology of the Arabic
language and the properties of the Qur’anic text. It should be based on modern methods of information retrieval for good stability and high speed search. It would be very
useful for researchers and could be generalized to cover all the content in Arabic.

Keywords : Indexing/Search, Arabic, Holy Quran, Information retrieval, Search
engines.

v
Résumé
Le Coran est différent de tous les documents que nous connaissons . C’est le livre
sacré des musulmans. Il comporte des connaissances sur tous les aspects de la vie. Avec
un tel volume d’informations, on ne peut y extraire qu’une infime partie manuellement.
Ceci s’avère être insuffisant vue la quantité de connaissances que contient le Coran. D’où
la nécessité de trouver une méthode pour extraire ces informations. Or il n’existe aucun
outil à utiliser sauf quelques lexiques imprimés et quelques outils de recherche simple
et séquentielle par les expressions régulières. En raison de cette limitation, le Coran
nous oblige à trouver de nouvelles façons d’interaction.
Le but recherché à travers ce travail est de proposer un système avancé de recherche
dans l’ensemble des informations contenues dans le Coran en prenant en considération
la morphologie de la langue Arabe et les propriétés du texte coranique. Elle doit être
fondée sur les méthodes modernes de recherche d’informations pour obtenir une bonne
stabilité et une recherche de grande vitesse. Elle serait trés utile pour les chercheurs et
pourrait être généralisée pour couvrir l’ensemble du contenu en arabe.

Mots clés : Indexation/Recherche, Arabe, Coran, Recherche d’information, Moteurs de recherche.

vi
‫ملخّص‬
‫القرآن الكريم يختلف عن جميع الوثائق التي نعرفها فهو يحوي المعارف في جميع جوانب الحياة. مع‬
‫هذا الحجم من المعلومات، لا يستطيع المرء استخراج إلا النزر اليسير يدويا وهذا ليس كافيا بالنسبة‬
‫لحجم المعارف الواردة في القرآن الكريم. ومن هنا جاءت الحاجة إلى إيجاد طريقة لاستخراج هذه‬
‫المعلومات. لا توجد أي وسيلة حالية فعالة باستثناء بعض المعاجم المطبوعة وبعض الأدوات التي‬
‫تعتمد البحث البسيط التسلسلي بالعبارات النمطية. وبسبب هذا القيد، يتوجب علينا إيجاد طرق‬
‫جديدة للتفاعل.‬
‫الهدف من خلال هذا العمل هو اقتراح نظام متقدم للبحث في جميع المعلومات الواردة في‬
‫القرآن الكريم، مع الأخذ بعين الاعتبار مورفولوجيا اللغة العربية وخصائص النص القرآني. وينبغي‬
‫الاستناد إلى الأساليب الحديثة في استرجاع المعلومات من أجل تحصيل استقرار جيد وسرعة عالية‬
‫في البحث. هذا العمل مفيد للباحثين ودارسي القرآن ويمكن أن يعمم ليشمل جميع المحتوى باللغة‬
‫َّ‬
‫العربية.‬
‫كلمات مفتاحية: بحث/فهرسة، العربية، القرآن، استخراج المعلومات، محركات البحث.‬

‫‪vii‬‬
Contents

Dedication

iii

Acknowledgment

iv

Table of Contents

xv

List of Figures

xvii

List of Tables

xix

List of Abbreviations

xx

Glossary

xxiv

General Introduction

1

I State Art

4

1 Search engines

5

1.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.1

Keyword

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.2

Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.3

Document

6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii
1.2.4

Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.5

Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.3

Search engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.4

Full-text search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.5

Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.5.1

Crawler Features . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.5.1.1

Features a crawler must provide . . . . . . . . . . . .

9

1.5.1.2

Features a crawler should provide . . . . . . . . . . .

9

Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.6.1

Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.6.2

Indexing modes . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.6.2.1

Manual indexing . . . . . . . . . . . . . . . . . . . . .

11

1.6.2.2

Automatic indexing . . . . . . . . . . . . . . . . . . .

12

1.6.2.3

Semi-automatic indexing . . . . . . . . . . . . . . . .

13

Index types . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.6.3.1

Document Index

. . . . . . . . . . . . . . . . . . . .

13

1.6.3.2

Forward Index . . . . . . . . . . . . . . . . . . . . . .

14

1.6.3.3

Inverted index . . . . . . . . . . . . . . . . . . . . . .

14

1.6.3.4

N-gram index . . . . . . . . . . . . . . . . . . . . . . .

15

1.6.4

Index storage . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

1.6.5

Index update . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

1.6.5.1

Incremental update . . . . . . . . . . . . . . . . . . .

16

1.6.5.2

Global update . . . . . . . . . . . . . . . . . . . . . .

16

Indexing phases . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

1.6.6.1

Tokenization . . . . . . . . . . . . . . . . . . . . . . .

16

1.6.6.2

Normalization . . . . . . . . . . . . . . . . . . . . . .

17

1.6.6.3

Elimination of stop-words . . . . . . . . . . . . . . . .

17

1.6.6.4

Weighting . . . . . . . . . . . . . . . . . . . . . . . . .

17

Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

1.7.1

Relevance concept . . . . . . . . . . . . . . . . . . . . . . . . .

19

1.7.2

Similarity Function . . . . . . . . . . . . . . . . . . . . . . . . .

19

1.7.3

Search process . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

1.8

Semantic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

1.9

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

1.6

1.6.3

1.6.6

1.7

2 Arabic Language

25

2.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.2

Orthography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25
ix
2.3

Lexicography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.3.1

26

Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1.1

‫. :)الفعل‬
Verbs with augmented root (‫:)الفعل المزيد‬

. . . . . . .

26

. . . . . . .

27

Nouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.3.1.2
2.3.2

Verbs with a simple root (‫المجرد‬
ّ

2.3.2.1

‫. . . . . . :)الأسماء‬
ّ
Nouns derived from verbals (‫: )الأسماء المشتقة‬
Primitive nouns (‫الجامدة‬

. . . . .

28

. . . .

28

2.3.2.3

Numbers : . . . . . . . . . . . . . . . . . . . . . . . .

28

2.3.2.4

Demonstrative pronouns (‫الإ شارة‬

. . . . . . . .

28

. . . . . . . . .

29

. . . . . . . .

29

Function words . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.4.1

Flexional Morphology . . . . . . . . . . . . . . . . . . . . . . .

30

2.4.1.1

Flexion of verbs . . . . . . . . . . . . . . . . . . . . .

31

2.4.1.2

Flexion of nouns . . . . . . . . . . . . . . . . . . . . .

32

2.4.1.3

Flexion of function words . . . . . . . . . . . . . . . .

34

Derivational morphology . . . . . . . . . . . . . . . . . . . . . .

34

2.3.2.2

2.3.2.5
2.3.2.6
2.3.3
2.4

2.4.2

2.4.2.1
2.4.2.2

‫:) ٔاسماء‬

‫. . :) أسماء‬
Personal pronouns ( ‫:) الضمائر المنفصلة‬
Relative pronouns (‫موصولة‬

Deverbal noun (‫:)المصدر‬
Active participle (‫فاعل‬

. . . . . . . . . . . . . . . .

‫. . . . . . . . . . :)اسم‬
Passive participle (‫. . . . . . . . . :)اسم مفعول‬
Nouns of time and place (‫:)أسماء الزمان والمكان‬
Noun of instrument (‫. . . . . . . . . :)اسم الآلة‬
The Nomen Vicis (‫. . . . . . . . . . :)اسم المرة‬
The Nomen Speciei (‫. . . . . . . . . :)اسم الهيئة‬

35

. . . .

35

. . . .

35

. . . .

35

. . . .

35

. . . .

36

. . . .

36

Ambiguity issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.5.1

The absence of vocalization . . . . . . . . . . . . . . . . . . . .

36

2.5.2

Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.5.3

Suffixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.6

The computerization of  Arabic language . . . . . . . . . . . . . . . . .

39

2.7

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

2.4.2.3
2.4.2.4
2.4.2.5
2.4.2.6
2.4.2.7
2.5

3 The Qur’an

41

3.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.2

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.3

Qur’an Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.3.1

Fragmentation into surahs . . . . . . . . . . . . . . . . . . . . .

43

3.3.2

Fragmentation into Hizbs . . . . . . . . . . . . . . . . . . . . .

43
x
3.3.3

44

Qur’anic Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

3.4.1

Knowledge of Ayahs revelation places

. . . . . . . . . . . . . .

45

3.4.2

Knowledge of Ayahs revelation causes: . . . . . . . . . . . . . .

46

3.4.3

Knowledge of Morphology:

46

3.4.4

3.4

Fragmentation into Stops (Waqfs) . . . . . . . . . . . . . . . . .

‫. . . . . . : )علم مرسوم‬
Grammatical analysis of the Qur’an (‫: )إعراب ألفاظ القرآن‬
Science of allegorical ayahs (‫. . . . . . . . : )علم المتشابه‬
The beginnings of surahs (‫. . . . . . . . . . :)فواتح السور‬
ّ
Knowledge of Qur’anic Parables (‫. . . . . :)الأمثال القرآنية‬
Tafssīr (‫. . . . . . . . . . . . . . . . . . . . . . :)التفسير‬

. . . . . . . . . . . . . . . . . . . .

ّ
Knowledge of Orthography (‫الخط‬

. . . .

47

. . .

48

. . . .

48

. . . .

48

. . . .

49

. . . .

49

Computerization of the Qur’an . . . . . . . . . . . . . . . . . . . . . .

49

3.5.1

Advantages of Computerization . . . . . . . . . . . . . . . . . .

50

Qur’an Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

3.6.1

History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

3.6.1.1

Indexing words of the Qur’an . . . . . . . . . . . . . .

51

3.6.1.2

Ma’ājim of Qur’anic words . . . . . . . . . . . . . . .

51

3.6.1.3

Specialized Ma’ājim of Qur’anic words

. . . . . . . .

52

3.6.1.4

Qur’anic Indexes and Computer . . . . . . . . . . . .

52

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.6.2.1

By unit: . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.6.2.2

By purpose . . . . . . . . . . . . . . . . . . . . . . . .

56

Projects of building indexes . . . . . . . . . . . . . . . . . . . .

58

3.6.3.1

Midād lbayān

58

3.6.3.2

Indexes by Taha Zerrouki

3.6.3.3

Qur’anic Arabic Corpus

3.4.5
3.4.6
3.4.7
3.4.8
3.4.9
3.5
3.6

3.6.2

3.6.3

. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .

60

Tanzil Project . . . . . . . . . . . . . . . . . . . . . .

62

3.6.3.5

Boundary-Annotated Qur’an Corpus . . . . . . . . . .

64

3.6.3.6

Qurany Concepts Tool . . . . . . . . . . . . . . . . . .

65

Qur’an Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

3.7.1

Qur’anic Concepts Ontology

. . . . . . . . . . . . . . . . . . .

67

3.7.2
3.8

59

3.6.3.4

3.7

. . . . . . . . . . . . .

The Ontology made by Hadj Henni: . . . . . . . . . . . . . .

68

Qur’anic Search Tools . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.8.1

69

Alawfa (‫. . . . . . . . . . . . . . . . . . . . . . . . . . . )الأوفى‬

3.8.2

Al-Monaqeb-Alqurany (‫القرآني‬

‫)المنقب‬

. . . . . . . . . . . . . .

70

3.8.3

Quran complex search service . . . . . . . . . . . . . . . . . . .

71

3.8.4

Quranic Researcher (‫القرآني‬

72

‫)الباحث‬

. . . . . . . . . . . . . . . .

xi
3.8.5

Quranologie (‫القرآن‬

. . . . . . . . . . . . . . . . . . . . . .

73

3.8.6

Quranic Corpus Word-by-Word Search . . . . . . . . . . . . . .

74

3.8.7

Tanzil (‫. . . . . . . . . . . . . . . . . . . . . . . . . . . . )تنزيل‬

75

Zekr (‫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . )ذكر‬

76

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

3.8.8
3.9

II

‫)علم‬

Analysis & Conception

78

4 Classification & Proposition of Qur’anic Search Features

79

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

4.2

Difficulties of Search in Quran

. . . . . . . . . . . . . . . . . . . . . .

79

4.3

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

4.4

Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

4.4.1

Advanced Query . . . . . . . . . . . . . . . . . . . . . . . . . .

83

4.4.2

Output Improvements . . . . . . . . . . . . . . . . . . . . . . .

84

4.4.3

Suggestion Systems . . . . . . . . . . . . . . . . . . . . . . . . .

85

4.4.4

Linguistic Aspects . . . . . . . . . . . . . . . . . . . . . . . . .

87

4.4.5

Qur’anic Options . . . . . . . . . . . . . . . . . . . . . . . . . .

91

4.4.6

Semantic Queries . . . . . . . . . . . . . . . . . . . . . . . . . .

93

4.4.7

Statistical System . . . . . . . . . . . . . . . . . . . . . . . . .

96

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

4.5.1

4.5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

4.5.1.1

Survey Participants Details . . . . . . . . . . . . . . .

97

4.5.1.2
4.6

Survey

Results of survey . . . . . . . . . . . . . . . . . . . . .

99

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5 Conception

102

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2

Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.1

Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.2.2

Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.2.3

Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.2.4

Results Processing . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.2.5

Indexes Importing . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.3

Full vocalized search engine . . . . . . . . . . . . . . . . . . . . . . . . 107

5.4

Othmani script and text processing . . . . . . . . . . . . . . . . . . . . 111
5.4.1

Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.1.1

Romanizations . . . . . . . . . . . . . . . . . . . . . . 113
xii
5.4.1.2

Numbers into words:

. . . . . . . . . . . . . . . . . . 114

5.4.2
5.4.3

Normalization: . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.4.4

Filtering stop-words: . . . . . . . . . . . . . . . . . . . . . . . . 120

5.4.5
5.5

Tokenization: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Lemmatization: . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Qur’anic Word Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5.1
5.5.2

Semantically Related Words . . . . . . . . . . . . . . . . . . . . 125

5.5.3

Multi-level Derivations . . . . . . . . . . . . . . . . . . . . . . . 126

5.5.4

Specific Derivations . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.5.5
5.6

Word properties search . . . . . . . . . . . . . . . . . . . . . . . 124

Fuzzy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

III Implementation
6 Implementation

130
131

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.2

Why Open Source? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.1

License : AGPL

. . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2.2

Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2.3

Whoosh Search API . . . . . . . . . . . . . . . . . . . . . . . . 134

6.3

Previous Code Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.4

Our improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.4.1

A New Centralized JSON Output System: . . . . . . . . . . . . 138

6.4.2

Many new features . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.4.3

Resource Importing Manager . . . . . . . . . . . . . . . . . . . 142

6.4.4

Automating the API building . . . . . . . . . . . . . . . . . . . 143

6.4.5

A new console interface . . . . . . . . . . . . . . . . . . . . . . 143

6.4.6

Enhancing the web interface . . . . . . . . . . . . . . . . . . . . 143

6.4.7

Packaging system: . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.4.8

Multiple search units . . . . . . . . . . . . . . . . . . . . . . . . 144

6.4.9

Coding Standardization . . . . . . . . . . . . . . . . . . . . . . 145

6.4.10 Documentation covering . . . . . . . . . . . . . . . . . . . . . . 146
6.4.11 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.5

Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.5.1

Application Programming Interface . . . . . . . . . . . . . . . . 150
6.5.1.1

JSON web service . . . . . . . . . . . . . . . . . . . . 152

xiii
6.5.1.2
6.6

Console interface

. . . . . . . . . . . . . . . . . . . . 153

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

General Conclusion

156

Bibliography

158

Appendices

A1

Annex A: Paper Abstracts

A2

xiv
List of Figures

1.1

The various components of a web search engine . . . . . . . . . . . . .

9

1.2

Indexing Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.3

Search process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

1.4

Search results returned as a web page: one of the possible ways to expose
results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.1

Ligature of lâm and álif. . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.1

Explanatory diagram of fragmentation into Surahs . . . . . . . . . . .

43

3.2

Explanatory diagram of fragmentation into Hizbs . . . . . . . . . . . .

44

3.3

Index using the word as a unit [Arabic Quranic Corpus]

. . . . . . . .

53

3.4

Index that takes the ayah as a unit [Arabeyes Quran Model] . . . . . .

54

3.5

Classification indexes by purpose . . . . . . . . . . . . . . . . . . . . .

56

3.6

The various structures of Qur’an . . . . . . . . . . . . . . . . . . . . .

57

3.7

Preview of Qurany . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

3.8

A closer look at Qur’an Concepts Ontology

68

3.9

Diagram of domain ontology of Qur’anic documents made by Hadj

. . . . . . . . . . . . . . .

Henni[Hadjhenni2008] . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.10 Preview of Alawfa website: www.alawfa.com . . . . . . . . . . . . . . .

70

3.11 Preview of Al Monaqeb Alqurany . . . . . . . . . . . . . . . . . . . . .

71

3.12 Preview of Quran Complex Search page . . . . . . . . . . . . . . . . .

72

3.13 Preview of Quranic Researcher (www.quranicresearcher.com) . . . . . .

73

3.14 Preview of Quranologie (quranologie.com) . . . . . . . . . . . . . . . .

74

3.15 Preview of Qur’anic Arabic Corpus Word By Word search . . . . . . .

75

3.16 Preview of Tanzil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

3.17 Preview of Zekr Application . . . . . . . . . . . . . . . . . . . . . . . .

77

4.1

84

Pages view in Google.com . . . . . . . . . . . . . . . . . . . . . . . . .

xv
4.2

Highlight the keyword

‫( وحيدا‬alone) in Ayah 11 of al-modather – al-

fanous.org . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

4.3

Ayah in full diacritical marks - quran.com . . . . . . . . . . . . . . . .

85

4.4

Query spell correction in Google.com . . . . . . . . . . . . . . . . . . .

85

4.5

Focusing on ”Yaqub” in Ontology of concepts of corpus.quran.com

. .

86

4.6

Related searches suggestion in Google.com . . . . . . . . . . . . . . . .

87

4.7

Keyboard mapping Arabic to English in Google.com . . . . . . . . . .

87

4.8

Different romanizations for the word

. . . . . . . . . . . . . . . .

88

4.9

Different transliterations used in ElixirFM Resolve Online . . . . . . .

88

4.10 Syntactic Coloration of Basmalah – bayt-al-hikma.com . . . . . . . . .

88

4.11 Google Voice Search on Android . . . . . . . . . . . . . . . . . . . . . .

89

4.12 Annotations shown by Quranic Arabic Corpus website . . . . . . . . .

90

4.13 Divine names Highlight in Quran Reader iPhone application . . . . . .

91

4.14 Faceted Thematic Browsing - Qurany Project . . . . . . . . . . . . . .

93

4.15 Audience Background

. . . . . . . . . . . . . . . . . . . . . . . . . . .

98

4.16 Audience experience . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

‫خليفَة‬
َ

4.17 Clarity, Usefulness, and Need percentage of each feature . . . . . . . . 100
5.1

Basic Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.2

The behavior of Searcher . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3

Text processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.4

Results processing phases . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.5

Different possible declensions of the word

5.6

Different types of the possible vocalizations of the word

5.7

Different writing forms of the word

5.8

Merging Words in Uthmani Script . . . . . . . . . . . . . . . . . . . . . 112

5.9

General Schema of Uthmani and Standard text processing. . . . . . . . 113

‫الملك‬

. . . . . . . . . . . . . 109

‫من‬

. . . . . . . 110

‫ بسطة‬using Othmani script

. . . . . 111

5.10 Example of Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.11 Tokenization of the word

‫فأَسق ْي َنٰكمو ُه‬
ُُ َْ َ

. . . . . . . . . . . . . . . . . . . . 116

5.12 Sub-tokens separation schema . . . . . . . . . . . . . . . . . . . . . . . 118
5.13 Example of tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.14 Arabic case markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.15 Example of normalization . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.16 Example of stop-word filtering
5.17 Examples of lemmatization

. . . . . . . . . . . . . . . . . . . . . . 121

. . . . . . . . . . . . . . . . . . . . . . . . 122

5.18 Two-Steps search behavior . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.19 Semantically related words : Idols in Quran . . . . . . . . . . . . . . . 124
5.20 Word properties search example : First person, Plural, Masculine . . . 125
xvi
5.21 Searching through an ontology . . . . . . . . . . . . . . . . . . . . . . . 125
5.22 Semantically Related Words, Hyponymy of the word

‫( نبي‬prophet) .

. . 126

5.23 Multi-level Derivation Search example . . . . . . . . . . . . . . . . . . 127
5.24 Special derivations example, Imperative of

‫( قال‬to say)

. . . . . . . . . 128

6.1

Screenshot of the Qt desktop interface . . . . . . . . . . . . . . . . . . 136

6.2

Json results of

6.3

Fuzzy search example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.4

Showing adjacent ayahs . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.5

Showing ayahs in different scripts . . . . . . . . . . . . . . . . . . . . . 141

6.6

Suggestion example of Vocalizations , Derivations ,and Synonyms of

6.7

Annotations of the keyword

6.8

Buckwalter translation example . . . . . . . . . . . . . . . . . . . . . . 142

6.9

Fields table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

‫الكوثر‬

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

‫. قيمة‬

‫قول‬

141

. . . . . . . . . . . . . . . . . . . . . 141

6.10 Translation-as-unit search , Query: seven

. . . . . . . . . . . . . . . . 145

6.11 Word-as-unit json outout, Query:

‫قيمة‬

6.12 Interfaces dependency hierarchy

. . . . . . . . . . . . . . . . . . . . . 150

. . . . . . . . . . . . . . . . . . 145

6.13 API usage sample code . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.14 Preview of the JSON web service . . . . . . . . . . . . . . . . . . . . . 153
6.15 Preview of the Console interface . . . . . . . . . . . . . . . . . . . . . . 153

xvii
List of Tables

1.1

Document Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

1.2

Forward Index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

1.3

Inverted Index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

1.4

N-gram index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.1

3 types of Arabic letters: 1 form, 2 forms or 4 forms . . . . . . . . . . .

26

2.2

The change of meaning by changing the diacritical marks . . . . . . . .

36

2.3

The change of function by changing diacritical marks . . . . . . . . . .

37

2.4

Ambiguities of prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.5

Ambiguities due to suffixes . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.1

Waqfs types [Web-Islamweb] . . . . . . . . . . . . . . . . . . . . . . . .

44

3.2

Some numerical miracles of Qur’an[Nawfal1975] . . . . . . . . . . . . .

50

3.3

Index based on word parts . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.4

Index based on sentences– surah: al-fātiha . . . . . . . . . . . . . . . .

56

3.5

Overview of main index– Midād lbayān . . . . . . . . . . . . . . . . . .

58

3.6

Overview of words index – M.Taha Zerrouki . . . . . . . . . . . . . . .

59

3.7

Overview of the topics index– Taha Zerrouki . . . . . . . . . . . . . . .

60

3.8

Overview of the index of synonyms – Taha Zerrouki . . . . . . . . . . .

60

3.9

Overview of morphology index – Quranic Arabic corpus . . . . . . . . .

62

3.10 Example of simple proper Quranic text – Tanzil.info . . . . . . . . . .

63

3.11 Example of Surah index – Tanzil.info . . . . . . . . . . . . . . . . . . .

63

3.12 Sajdah index – Tanzil.info . . . . . . . . . . . . . . . . . . . . . . . . .

64

3.13 Example of rub’ index – Tanzil.info . . . . . . . . . . . . . . . . . . . .

64

3.14 Sample of Boundary Annotated Qur’an Corpus . . . . . . . . . . . . .

65

5.1

Partial vocalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.2

Some numbers as they mentioned in Quran

. . . . . . . . . . . . . . . 115

xviii
6.1

Search request flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.2

Pylint Analysis stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.3

Implementation State of search features

. . . . . . . . . . . . . . . . . 148

xix
List of Abbreviations
AGPL

Affero General Public License.

API

Application Programming Interface.

GPL

GNU Public License.

GUI

Graphical User Interface.

IDF

Inverse Document Frequency.

OWL

Web Ontology Language.

PC

Personal Computer.

POS

Part Of Speech.

POS

Part Of Speech.

RSV

Retrieval Status Value.

TF

Term Frequency.

TF*TDF

term frequency - inverse document frequency.

UI

User Interface
.

xx
Glossary

Abrogated ayahs
Abrogating ayahs
Accusative
Active Participle
Active voice
Allegorical ayah
Assimilated verb
Attaching pronoun
Ayah
Basmalah
Book Science
Broken plural
Conjugation
Declension
Declinable
Defect
Demonstrative pronoun
Deverbal noun
Diacritical marks
Diphthong
Diptote
Dual form
Expansion

‫الآيات_المنسوخة‬
‫الآيات_الناسخة‬
‫حالة_النصب‬
‫اسم_الفاعل‬
‫صيغة_مبني_للمعلوم‬
‫الآيات_المتشابهات‬
ِ
‫فعل_م َثال‬
‫ضمير_م َّتصل‬
‫آية‬
‫بسملة‬
‫علم_الكتاب‬
‫جمع_تكسير‬
‫تصريف_الأفعال‬
‫الإ عراب‬
‫معرب‬
‫علّة‬
‫اسم_إشارة‬
‫مصدر‬
‫علامات_التشكيل‬
‫الإ دغام‬
‫ممنوع_من_الصرف‬
‫المثنّى‬
‫الزيادة‬
ّ
xxi
External feminine plural
External masculine plural
External plural
Fiqh
First ayahs of surah
First person
Fusional language
Geminated verb
General and Particular
Genitive
Hamzah
Hamzated verb
Healthy verb
Hizb
Hollow verb
Imperative
Imperfective
Instrument noun
Internal plural
Inversion
Jussive
Juz’
Lam-ALef
Last ayahs of surah
Laws
Lemma
Lexicology
Makkan
Medinan

‫جمع_مؤنث_السالم‬
‫جمع_مذكر_السالم‬
‫جمع_سالم‬
‫الفقه‬
‫فواتح_السورة‬
‫المتكلم‬
‫لغة_إشتقاقية‬
‫فعل_مضاعف‬
َ َ ُ
ّ‫الخاص_والعام‬
ّ
‫حالة_الجر‬
‫همزة‬
‫فعل_مهموز‬
َْ
‫فعل_صحيح‬
ِ
‫ح ْزب‬
‫فعل_أَج َوف‬
ْ
‫الأمر‬
‫المضارع‬
‫اسم_الآلة‬
‫جمع_تكسير‬
‫القلب‬
‫حالة_الجزم‬
‫ج ْزء‬
ُ
‫لام_ألف‬
‫خواتيم_السورة‬
‫الأحكام‬
‫جذع_الكلمة‬
‫علم_المعاجم‬
‫مكّ ّية‬
‫مدن ّية‬
xxii
Migration of Prophet
Morphology
Mus’haf
Narration of Hadiths
Nisf
Nomen Speciei
Nomen Vicis
Nominative
Noun of place
Noun of time
Object
Orthography
Othmani script
Passive Participle
Passive voice
People of the book
Perfective
Personal pronoun
Plural form
Primitive noun
Prophet’s Sunnah
Prosody
Prostration of recitation
Qiblah
Qur’anic comma
Qur’anic Parable
Recitation
Relative pronoun
Revelation Science
Rewayate
Rewayate of Kaloun

‫الهجرة_النبوية‬
‫الصرف‬
‫المصحف‬
ُ
‫رواية_الأحاديث‬
‫نِصف‬
ْ
‫اسم_الهيئة‬
‫اسم_المرة‬
َّ
‫حالة_الرفع‬
‫اسم_مكان‬
‫اسم_زمان‬
‫مفعول‬
ّ
‫علم_مرسوم_الخط‬
ّ
‫الخط_العثماني‬
‫اسم_المفعول‬
‫صيغة_مبني_للمجهول‬
‫أهل_الكتاب‬
‫الماضي‬
‫ضمير_منفصل‬
‫الجمع‬
‫اسم_جامد‬
‫الس َّنة_النبوية‬
ُ
‫تقطيع_العروض‬
‫سجود_التّلاوة‬
ِ
‫القبلة‬
‫فاصلة_قرآنية‬
‫م َثل_قرآني‬
َ
‫التلاوة‬
‫اسم_موصول‬
‫علم_التنزيل‬
‫رواية‬
‫رِواية_قالون‬
xxiii
Rhetoric
Root
Rubu’
Sajdah
Second person
Singular form
Standard script
Subject
Superlative noun
Surah
Surah keys
Tafssir
The Five Nouns
Third person
Thumn
Translation of Qur’an
Triptote
Verb with a simple root
Verb with augmented root
Virtues of surah
Waqf
Weakened verb

‫البلاغة‬
‫جذر_الكلمة‬
‫ر ُبع‬
ُ
‫سجدة‬
‫المخاطب‬
‫المفرد‬
‫الخط_الإ ملائي‬
‫فاعل‬
‫اسم_تفضيل‬
‫سورة‬
‫مفاتيح_السورة‬
ّ
‫علم_تفسير_القرآن‬
‫الأسماء_الخمسة‬
‫الغائب‬
‫ثُمن‬
ُ
‫ترجمة_معاني_القرآن‬
‫منصرف‬
‫فعل_مجرد‬
ّ
‫فعل_مزيد‬
‫فضائل_السورة‬
‫َوقف‬
ِ
‫فعل_نَاقص‬

xxiv
General Introduction

Work Context
Qur’an, in Arabic, means the read or the recitation. Muslim scholars define it as: the
words of Allah revealed to His Prophet Muhammad, written in Mus’haf and transmitted
by successive generations (‫[)التواتر‬Mahssin1973]. The Qur’an is also known by other
names such as: Al-Furkān , Al-kitāb , Al-dhikr , Al-wahy and Al-rōuh . It is the
sacred book of all Muslims and the first reference to Islamic law. It’s more then 14
centuries passed since its revelation, and the Muslims are still studying it, teaching it,
writing books about it and recently developing applications for it.
Qur’an is an important source of information that contains various information
about all aspects of life: Scientific, Social, Historic, Politic...etc.

Problematic
Due to the large amount of information held in the Qur’an, it has become extremely
difficult for regular search engines to successfully extract key information. For example,
When searching for a book related to English grammar, you’ll simply Google it, select
a PDF file and download it. That’s all! Search engines (like Google) are used generally
on Latin letters and for searching general information of document like content, title,
author…etc. However, searching through Qu’ranic text is a much more complicated;
It’s procedure that’s requiring a much more in depth solution as there is a lot of
information that needs to be extracted to fulfill Qur’an scholar’s needs. Before the
creation of computer, Qur’an scholars were using printed lexicons made manually. The
printed lexicons can’t help much since many search process waste the time and the
force of the searcher. Each lexicon is written to reply to a specific query which is
generally simple. Nowadays, there are many applications that are specific for search
needs; most of applications that were developed for Qur’an had the search feature but

1
General Introduction
in a simply way: sequential search with regular expressions.
The simple search using exact query does not offer better options and still inefficient
to move toward Thematic search by example. Full text search is the new approach
of search that replaced the sequential search and which is used in search engines.
Unfortunately, this approach is not applied yet on Qur’an. The question is why we
need this approach? Why search engines? Do applications of search in Qur’an really
need to be implemented as search engines?

Objectives
Our proposal is about design a retrieval system that fit the Qur’an search needs. But
to realize this objective, we must first list and classify all the search features that are
possible and helpful. Then we need to study how to implement each feature and what
is its requirements.

Report organization
We organized the report as follows:

First Part : Art State
This part contains 3 chapters:
Chapter 1 : Search Engines
To design a powerful search engine, it is essential to understand how search engines
work, in this chapter we discuss the different parts of a search engine, namely: the
crawling, indexing and querying . And the definition of basic concepts in the field of
information retrieval systems. This chapter contains an introduction to the semantic
approach.
Chapter 2 : Arabic Language
The objective of this chapter is to present the properties of the Arabic language,
its spells, its morphology and to introduce some ambiguity issues that raise due to the
Arabic nature ... etc.
Chapter 3 : The Qur’an
This chapter presents an overview of the Qur’an and its sciences, it has a historical
background on the evolution of the Qur’an, the structure of the Mus’haf, and the
main problems of computerization of the Qur’an, including the script Uthmani and
authentication Qu’ranic texts.

2
General Introduction

Second Part : Analysis & Conception
This part contains two chapters :
Chapter 4 : Qur’anic search features
The objective of this chapter is to present the possible search features in Qur’an.
It has a big importance in our work since it defines our objectives and our path on
the work. We’ve make a survey about Usefulness, Need, and Clarity of each feature in
order to validate our points of view in choosing those features.
Chapter 5 : Conception
In this chapter, we start by a preview on our previous work then we’ll propose many
improvements to carry out all the feasible search features mentioned in the previous
chapter.

Third Part : Implementation
This part contains the different steps of implementation of our retrieval system. It
includes one chapter:
Chapter 6 : Implementation
This chapter describes the choice of technologies and development tools and also
presents the prototype with a description of various features.
Finally, we finish the report with a conclusion that summarizes our work. We
include an appendix that describes the papers published about this work. Actually
there are two papers:
• An Arabic paper in NITS 2011 KSA entitled ”An Application Programming Interface for indexing and search in Noble Quran”1 [Chelli2011].
• An English paper in a pre-conference workshop in LREC 2012 Turkey which is
about ”LRE-Rel: Language Resource and Evaluation for Religious Texts”. The
paper was entitled ”Advanced Search in Quran: Classification and Proposition of
All Possible Features”[Chelli2012].

1

Arabic title:

‫مكتبة برمجية للفهرسة والبحث في القرآن الكريم‬
3
Part I
State Art

4
Chapter 1
Search engines

How could the world beat a path to your door when the path
was uncharted, uncatalogued, and could be discovered only
serendipitously?
Paul Gilster, Digital Literacy

1.1

Introduction

Our work falls within the field of Information Retrieval, as it aims to design a search
engine, in this chapter we will discuss how search engines work by explaining its main
components.
Exploration is the part that feeds the search engine by documents that it collects,
but with the amount of information that becomes larger and larger, it is necessary to
develop methods of search, only indexing able to accelerate search in very large systems
such as the Web, because it anticipates the search by extracting and arranging them
keywords.
So that search results be satisfactory, we must properly calculate the relevance of
results against the query, this is done during the interrogation. The question must also
be able to express simple questions as well as complex questions.
The quality of research is directly related to the quality of the crawling, indexing
and search, these three operations can be considered as the core of search engine, the
objective of this chapter is to define the main concepts of this area, starting with
defining the crawling, then study indexing, its methods and steps, and then we shall
explain the process of search and the notion of relevance.
5
Chapter 1

1.2

Definitions

1.2.1

Keyword

Word or set of words chosen to represent the contents of a document, and find it in
document search. It can be coming from the document (title, text, abstract, ...) or a
controlled vocabulary..[Hensens1998]

1.2.2

Descriptor

Keyword selected from a set of equivalent terms to represent clearly a concept. It is usually part of an organized and hierarchical vocabulary of type ”thesaurus”.[Hensens1998]

1.2.3

Document

A document can be text, a piece of text, web page, image, video, etc. We call Document
any unit that can be an answer to a user query. For textual documents, there are many
forms regarding their specification. A document can be a text without any structure
(it is also called full-text) and may also be a text with a structured part (document
partially structured or semi structured) or fully structured. [Amrouche2008]

1.2.4

Query

A query expresses the need for information of a user. Various types of query languages
have been proposed to formulate a query. A query can be expressed:
• In natural (or almost) language (eg: ”find all the manufacturing facilities of cars
and their addresses”)[Salton1971]
• In a structured format, also called Boolean query language (eg: ”cars and factories
and brand”)[Bourne1979]
• As graphical language from a GUI [Lelu1992]

1.2.5

Relevance

Relevance is a word that simply means returning the information considered the most
useful at the top of a result list. While the definition is simple, getting a program to
compute relevance is not a trivial task, mainly because the notion of usefulness is hard
for a machine to understand. [Bernard2009]

6
Chapter 1

1.3

Search engines

A search engine is software that allows to regain resources (web pages, images, video,
files, information ... etc.) related to any words. Some websites offer a search engine as
the main feature, called then search engine the website itself (Google, Yahoo, Bing ...
are search engines). [Nejjari2007]
A search engine is also a crawling tool on the web made up of ”robots” that explore
the websites periodically and automatically (without human intervention, that is what
distinguishes search engines from directories). They follow the links (pages that link
to each other) encountered on each page reached. Each identified page will be indexed
in a database, then will be accessible by Internet users using keywords.[Sanan2008]
Search engines do not apply only to Web: some engines are softwares installed on
personal computers. These are known as desktop search engines , they aims the search
in the files stored on the PC - include such Exalead Desktop, Google Desktop and
Copernic Desktop Search ... etc.
In December 2004, Tim Berners Lee (the inventor of the World Wide Web) talked
about a new project: ”Semantic Web” which is based on processing of the web information automatically according to their significances. The reason for this was that 80% of
Web contains texts intended to be read and understood by humans. While computer
programs, Web browsers and search engines are unable to understand this content, so
they are unable to speed up the search. In less than two years from the article of Lee,
the First foundations of Semantic Web were formed. They seemed to lead the world
toward a new revolution in Internet and search engines .[Abulhajjaj2009]
At first glance, nothing distinguishes a classic search engine from a semantic one.
The same sparse interface, with a text box in the center of the page where the user can
enter his search query. In fact, the difference lies in the search mode. A classic search
engine , as Google, works as follows: its robots index browse the pages and index the
words. Then store these words in a gigantic database. Users can do search by send
their queries and a search algorithm retrieve the results and sort them in a certain
order based on their relevance.[Mentre2008]

1.4

Full-text search

Full-text search is a technology focused on finding documents matching a set of words
.
While sounding like a mouthful, full-text search is more common than you might
think. You probably have been using full-text search today. Most of the web search

7
Chapter 1

engines such as Google and Yahoo! use full-text search engines at the heart of their
service. The differences between each of them are recipe secrets (and sometimes not
so secret), such as the Google PageRankTM algorithm. PageRankTM will modify the
importance of a given web page (result) depending on how many web pages are pointing
to it and how important each page is .
Be careful, though; these so-called web search engines are way more than the core
of full-text search: They have a web UI , they crawl the web to find new pages or
existing ones, and so on. They provide business-specific wrapping around the core of
a full- text search engine.
Given a set of words (the query), the main goal of full-text search is to provide
access to all the documents matching those words. Because sequentially scanning all
the documents to find the matching words is very inefficient, a full-text search engine
(its core) is split into two main operations: indexing the information into an efficient
format and searching the relevant information from this precomputed index. From the
definition, you can clearly see that the notion of word is at the heart of full-text search;
this is the atomic piece of information that the engine will manipulate. [Bernard2009]

1.5

Crawling

Crawling is the process by which we gather pages from the Web, in order to index
them and support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them. the web crawler is sometimes referred to as a spider .
[Manning2009]
This process is in the phase preceding the indexing phase, see Figure:

8
Chapter 1

Figure 1.1: The various components of a web search engine

1.5.1

Crawler Features

We list the desiderata for web crawlers in two categories: features that web crawlers
must provide, followed by features they should provide. [Manning2009]
1.5.1.1 Features a crawler must provide
Robustness :

The Web contains servers that create spider traps, which are gener-

ators of web pages that mislead crawlers into getting stuck fetching an infinite number
of pages in a particular domain. Crawlers must be designed to be resilient to such
traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty
website development .
Politeness :

Web servers have both implicit and explicit policies regulating the

rate at which a crawler can visit them. These politeness policies must be respected .
1.5.1.2 Features a crawler should provide
Distributed :

The crawler should have the ability to execute in a distributed

fashion across multiple machines .
Scalable :

The crawler architecture should permit scaling up the crawl rate by

adding extra machines and bandwidth .

9
Chapter 1

Performance and efficiency :

The crawl system should make efficient use of

various system resources including processor, storage and network band- width.
Quality :

Given that a significant fraction of all web pages are of poor utility for

serving user query needs, the crawler should be biased towards fetching “useful” pages
first.
Freshness :

In many applications, the crawler should operate in continuous mode:

it should obtain fresh copies of previously fetched pages. A search engine crawler,
for instance, can thus ensure that the search engine’s index contains a fairly current
representation of each indexed web page. For such continuous crawling, a crawler
should be able to crawl a page with a frequency that approximates the rate of change
of that page.
Extensible :

Crawlers should be designed to be extensible in many ways – to

cope with new data formats, new fetch protocols, and so on. This demands that the
crawler architecture be modular.

1.6

Indexing

To make the research cost acceptable, it should pass  by an essential phase in the
document database. This phase consists in analyzing each document in the collection to create a set of keywords: we call it the indexing phase. These keywords will
be more easily used by the system during the subsequent process of search. Indexing create a representation of documents in the system. Its objective is to find the
most important concepts of the document (or query), which form the descriptor of
document.[Sauvagnat2005]

1.6.1

Definition

Indexing is the act of describing or classifying a document by index terms or other
symbols in order to indicate what the document is about, to summarize its content
or to increase its find-ability. In other words, it is about identifying and describing
the subject of documents. Indexes are constructed, separately, on three distinct levels:
terms in a document such as a book; objects in a collection such as a library; and
documents (such as books and articles) within a field of knowledge.
The process of indexing begins with any analysis of the subject of the document.
The indexer must then identify terms which appropriately identify the subject either
10
Chapter 1

by extracting words directly from the document or assigning words from a controlled
vocabulary. The terms in the index are then presented in a systematic order. Indexers
must decide how many terms to include and how specific the terms should be. Together
this gives a depth of indexing.[Lancaster2003]
Indexing is most often used to information retrieval. But it can also be used in other
areas such as automatic classification of documents,   keyword suggestion, co-occurring
terms calculating, automatic summarization, etc.[Abar2009]

Figure 1.2: Indexing Benefits

1.6.2

Indexing modes

1.6.2.1 Manual indexing
Manual indexing is achieved by a human expert (librarian or specialist in the field )
that analyzes the content of the text to identify the terms representing the document.
Manual indexing ensures greater relevance in the answers, because it identifies a
more specific keywords describing a document.
However, it has several drawbacks, there is the problem of used vocabulary and
the dependence on indexer’s knowledge on the topic, ie the same document can be
indexed in several ways (according to vision of the person who makes the indexing),
and an indexer at two different times can have two distinct terms to represent the same
concept.
The major drawback of this method is the cost in time, this method is not therefore
appropriate when the number of documents to be indexed is substantial. [Sauvagnat2005,
Abar2009, Amrouche2008]
11
Chapter 1

Manual indexing is based on four key points [Chartron1989] :
• reading the entire document for preparation ;
• consideration of descriptors, objectives (applications) and user needs;
• permanent complementarity between the terms of manual indexing and abstract;
• in the absence of appropriate descriptor, and when the emergence of a new concept is not explicit enough to propose a candidate descriptor, the ability to use
a close or generic descriptor.
So we thought fast enough to use the computer[Mustafa].
1.6.2.2 Automatic indexing
Automatic indexing is a set of automated processing phases applied on documents. We
distinguish: Tokenization (automatic extraction of word), Elimination of stop words,
Stemming (Lemmatization or radicalization), Scoring of words and finally the creation
of the index[Sauvagnat2005].
The first approach to the automatic indexation KWIC (Key Word In Context-)
was introduced by Luhn (1957)[Luhn1957]. There was discussion about to weight the
index. In the early days of information retrieval, statistical methods were based on the
frequency of words in the document. Later, this measure was extended to take into
account the specificity of a term for the document. To this end, other methods have
been exploited, such as 2-Poisson (Nie, 2003)[Gaussier2003][Mustafa].
The automatic indexing systems use several methods of analysis:
1.6.2.2.1

Linguistic analysis:

Technology issued from ”text mining”, the latter

is to implement a simplified model of linguistic theories in computer systems of learning.
This is part of the artificial intelligence field . [Allab2008]
The linguistic method consists of several modules of linguistic analysis: morphological, lexical, syntactic and pragmatic. The fact that some systems use indexing
techniques of natural language processing, demonstrates the relevance of a linguistic
approach. [Elhachani1997]
1.6.2.2.2

Statistical analysis:

The initiator of the methods of the automatic

indexation is H.P. Luhn with his influential article “The automatic creation of literature abstracts” published in 1958 in the “Journal of Research and Development” of
IBM. He states : « (...) instead of sampling at ran- dom, as a reader normally does
when scanning, the new mechanical method selects those among all the sentences of
12
Chapter 1

an article that are the most representative of pertinent information », H. P. Luhn
opened the door to work on automatic indexing by proximity also called statistical
method[Luhn1958].
Automatic indexing involves the following steps :
• Extracting words (tokenization): the extraction rules are language-dependent.
• Eliminating stop words (stop words): these are words too frequent but unnecessary. Example: the, a, of, or ... etc.
• Stemming : for example the stem of the word ”stemmers” is ”stem”.
• Transformation rules: removal of plural endings.
• Truncation: choose an optimal value of truncation of words. It is better to
truncate suffixes. There is no absolute rule for this.
1.6.2.3 Semi-automatic indexing
The two previous techniques can be combined, a first automatic process to extract
the terms of the document. However the final choice remains the expert in the field
or librarian to establish semantic relations between keywords and choose the significant terms using a thesaurus or a terminology database which is an organized list of
descriptors (keywords) obeying specific terminology rules[Abar2009, Sauvagnat2005,
Hadjhenni2008].

1.6.3

Index types

The index is the output of the indexing process, there are several types of indexes
according to the used technique and the desired function:
1.6.3.1 Document Index
The document index keeps information about each document. It is an index ISAM
(Index sequential access mode) with a fixed width, ordered by the ID of the document.
The information stored in each entry includes data, a checksum of documents and
various statistics. If the document was crawled, it also contains a pointer to a variable
width file called the document information that contains the URL and title. This
design decision was driven by the desire to have a relatively compact data structure,
and the ability to find a record in one disk traversal, when queried.[Brin1998]
The following table is a simplified illustration of a document index:

13
Chapter 1

Document ID
Document 1
Document 2
Document 3

Text
The cow says moo
The cat and the hat
The dish ran away with the spoon

Link
/ex/doc1.txt
/ex/doc2.txt
/ex/doc3.txt

Table 1.1: Document Index

1.6.3.2 Forward Index
The forward index stores a list of words for each document. The following is a simplified
form of forward index:

Document ID
Document 1
Document 2
Document 3

Words
the, cow, says, moo
the, cat, and, the, hat
the, dish, ran, away, with, the, spoon
Table 1.2: Forward Index

The rationale behind developing a forward index is that as documents are parsing,
it is better to immediately store the words per document. The delineation enables
Asynchronous system processing, which partially circumvents the inverted index update bottleneck. The forward index is sorted to transform it to an inverted index. The
forward index is essentially a list of pairs consisting of a document and a word, collated
by the document. Converting the forward index to an inverted index is only a matter
of sorting the pairs by the words. In this regard, the inverted index is a word-sorted
forward index. [Brin1998]
1.6.3.3 Inverted index
Many search engines include an inverted index when evaluating a search query to
quickly retrieve documents that contain words in the query and then sort them by
relevance. Since the inverted index stores the list of documents containing each word,
the search engine can use direct access to find documents associated with each word in
a query to retrieve documents that respond quickly. The following table is a simplified
illustration of an inverted index:

14
Chapter 1

Word
the
cow
says
moo

Documents
Document 1, Document 3, Document 4, Document 5
Document 2, Document 3, Document 4
Document 5
Document 7
Table 1.3: Inverted Index

This index can identify only if a word exists in a particular document because it
does not store any information regarding the frequency or the position of the word. It
is considered an index of boolean. This index determines which documents that match
a query, but does not classify them. In some models, the index includes additional
information such as frequency of each word in each document or positions of a word
in each document. The position information allow the search algorithm to identify the
adjacent words to support the search by phrases. The frequency can be used to assist
calculating the relevance of documents to the query. [Grossman2002, Tang2004]
1.6.3.4 N-gram index
An n-gram is a sequence of n consecutive characters. For any document, all n-grams
(usually n takes the values 2 or 3) we can generate, is the result obtained by shifting a
window of n squares on the body text. This shift occurs in steps, one step corresponds
to a character. Then we calculate the frequencies of n-grams found. for example

1

[Jalam2002]the french sentence ”La nourrice nourrit le nourrisson” is represented by :

n-grams
Frequencies

1
la_
1

2
a_n
1

3
_no
3

4
nou
3

5
our
3

6
urr
3

7
rri
3

8
ric
1

9
ice
1

10
_ce
1

11
e_n
2

12
rit
1

…
…
…

Table 1.4: N-gram index

One benefit of n-grams is automatic tracking of the most common stems [Grefenstette1995]:
dans in the previous example, using techniques based on n-grams we find the common
root of : Nourrir, nourri, nourrit, nourrissez, nourriture, etc. Tolerance to spelling
mistakes and distortions is also an important property. [Sanan2008]

1.6.4

Index storage

Storage of index structures is mainly characterized per the index size and organization
of its elements. Index structures vary widely in their use of size that is closely related
1

the character ”_” is used instead of spaces, in order to facilitate the reading.
15
Chapter 1

to the organization of data in the index.
This organization has a significant impact on latency of search. More items are
closely related to each other in the storage space is less latency research, this is called
the concept of locality. It is also very important that the index can hold in main
memory, it avoids disk access to the system and reduces the latency of search.
The ideal index is one that occupies less space and minimize search latency. [Dahak2006]

1.6.5

Index update

Updating the index refers to the behavior of applying changes on the index . Changes
can be insertions, modifications or deletions. An index can be more or less able to
adapt to these changes. This adaptation can occur in two forms:
1.6.5.1 Incremental update
In the case of an incremental update, the structure of the index is updated by adding
back the indexes of new documents without modifying existing ones. The number of
changes in this case is, however, often limited.[Dahak2006]
1.6.5.2 Global update
The third case, and worst is when the structure of the entire index must be rebuilt
from scratch.[Dahak2006]

1.6.6

Indexing phases

The indexing process consists of the following phases:
1.6.6.1 Tokenization
Tokenization is a phase that may seem trivial at first, and yet provide the basis for
the rest of the indexing phases. Therefore this phase must be done with the highest
quality.[Meylan2001]
Some retrieval systems use a list of predefined keywords. This list is designed
manually and, in most cases built for a specific topic. This method allow to control
the index size. The use of automatic extraction of keywords or the use of a list of
predefined keywords, determines the type of indexing. Document-oriented in the first
case and query-oriented in the second.[Berrut1997, Dahak2006]

16
Chapter 1

1.6.6.2 Normalization
This processing is to find for a word its normalized form (usually the masculine for
nouns, infinitive for verbs, the masculine singular for adjectives, etc.). Thus, in the
index are stored only in their normalized forms, which offers a significant size saving,
but more importantly, even if the processing is done on the request, it can be much
quicker and more flexible in research: for example, if a user searches with a verb,
documents that contains this verb in all its conjugated forms will be considered, not
just documents containing the word in the form provided by the user. This step is also
called ”morphological processing of keywords”[Denoyer2004]
This phase can also be enriched with syntactic and semantic processing of keywords.
The first is to identify and group a set of words whose meaning depends on their union.
For example, the words ”White House” does not usually mean you’re dealing with a
house that is white, but instead the seat of the presidency of the United States. It is
also to remove ambiguities such as the problems of homography.
Semantic processing is intended to make distinctions between different possible
meanings of a word (polysemy). For example, this phase helps differentiate the word
”room” that can match a coin, or a room in a house. This is an arduous task that
is not currently well controlled and its effect on system performance is not always
proven.[Dahak2006]
1.6.6.3 Elimination of stop-words
This phase is of some importance since it constitutes a factor of great influence in the
accuracy of the search. The failure to remove stop words inevitably cause noise. The
elimination of stop words which are words of everyday language and do not contain
much semantic information must be both in indexing as querying (removing stop-words
from the query). [Dahak2006]
1.6.6.4 Weighting
This step is entirely dependent on the model of information retrieval used. It defines
how important a term in a given document. [Dahak2006]
In general, most term weighting formulas are built by combination of two factors. A
local weighting factor measuring the local representativity of a term in the document,
and an overall weighting factor measuring the global representativity of a term with
respect to the collection of documents[Amrouche2008].
That leads to two types :

17
Chapter 1

1.6.6.4.1

Local weighting Local weighting takes into account the local informa-

tion of the term that depend only on the document. It is typically a function of
frequency of occurrence of the word in the document, denoted tf (Term Frequency).
A term that frequently appears in a document is considered relevant to describe its
contents. [Dahak2006]
1.6.6.4.2

Overall weighting The overall weight measures the importance of a

term within all documents. It aims to represent its discriminatory nature, or in other
words its ability to distinguish between document. In fact, a term appearing in few
documents is considered more discriminatory and should be favored over a term found
in many documents. The calculation of the overall weighting is based on the number
of documents in which a term appears. One of the most used is idf (Inverse Document
Frequency), represented by the following formula:
N
Idf = log( ni )

Such as ni is the number of documents containing the word i and N is the total
number of documents.
The value tf *idf gives a good approximation of the importance of a term in the
document, particularly in the corpus of documents of similar size. [Dahak2006]

1.7

Querying

Querying is the phase of interaction between the system and the user. This expresses
the need for information via a query language that the system will take care of interpreting. This interpretation is done according to the query template and is designed to
understand user needs and express them in a formalism similar to the one used when
indexing documents. This process provides an inner query. Following this phase of
query interpreting, a matching pattern calculates the match between the inner  query
and each document in the index. This calculation established by the mapping function,
has traditionally resulted in an ordered list of documents. It should, at this level, a
semantic comparison (not equal) between concepts in of document and those of the
query.
The comparison between query and document rarely leads to strict equivalences,
but rather to partial equivalences: the document is only part of the query. The first
document in the list returned by the system is one that is considered by the system
as the most relevant, that is to say the one that best suits the query, again according
to the system. The final document is one that is considered by the system as the
18
Chapter 1

least relevant. This notion of relevance is based on the proximity between the needs
expressed by the user and the results provided by the system.[Dahak2006]

1.7.1

Relevance concept

Relevance is a central concept of the query because all evaluations are based around
this concept. But it is also the most poorly understood concept, despite numerous
studies on this concept as the one in[Denos1997].
Let us see some definitions of the relevance. Relevance is:
• The correspondence between a document and a query, a measure of informativeness of the document to the query;
• A degree of relationship (overlap, relativity, ...) between the document and the
query;
• A degree of surprise that comes with a document that is relevant to the needs of
the user;
• A measure of usefulness of the document to the user.
Even in these definitions, the used concepts (informativeness, relativity, surprise ...)
remains very vague because users have very different needs. They have very different
criteria for judging whether a document is relevant. So the notion of relevance is used
to cover a very wide range of criteria and relations[Dahak2006].

1.7.2

Similarity Function

Comparing between the document and query is equivalent to calculating a score, assumed to represent the relevance of the document in respect to the query. This value
is calculated from a function or a probability of similarity denoted rsv(q,d) (retrieval
status value), such as q is a query and d est a document and whose formula depends
entirely on the used model of information retrieval. This measure takes into account
the weight of terms in documents determined by statistical analysis and probability.
The matching function is very closely related to the operations of indexing and weighting of query terms and documents in the corpus. In general, the matching document
- query and indexing model used to characterize and identify a model of information
retrieval. The similarity function is then used to order the documents returned to
the user. The quality of this ordering is paramount. In fact, users is generally satisfied to examine the first documents (the top 10 or 20). If the documents sought
are not present in this slice, the user will consider sorting as bad in respect to his
query[Sauvagnat2005, Dahak2006].
19
Chapter 1

1.7.3

Search process

Search takes a user query and returns the effective list of matching results sorted by
relevance. Such as indexing, searching is a multiphase process, as shown in Figure.

Figure 1.3: Search process

The first operation is about building the query. Depending on the full text search,
the way to express query is either: :
1. String based—A text-based query language. Depending on the focus, such a
language can be as simple as handling words and as complex as having Boolean
operators, approximation operators, field restriction, and much more!
2. Programmatic API based—For advanced and tightly controlled queries a programmatic API is very neat. It gives the developer a flexible way to express
complex queries and decide how to expose the query flexibility to users.
Some tools will focus on the string-based query, some on the programmatic API, and
some on both.
The second operation, let’s call it analyzing, is responsible for taking sentences or
lists of words and applying the similar operation performed at indexing time (chunk
20
Chapter 1

into words, stems, or phonetic description). This is critical because the result of this
operation is the common language that indexing and searching use to talk to each other
and happens to be the one stored in the index. If the same set of operations is not
applied, the search won’t find the indexed words—not so useful!
Based on the common language between indexing and searching, the third operation
(finding documents) will read the index and retrieve the index information associated
with each matching word. Remember, for each word, the index could store the list
of matching documents, the frequency, the word positions in a document, and so on.
The implicit deal here is that the document itself is not loaded, and that’s one of the
reasons why full-text search is efficient: The document does not have to be loaded
to know whether it matches or not. The next operation (filtering and ordering) will
process the information retrieved from the index and build the list of documents (or
more precisely, handlers to docu- ments). From the information available (matching
documents per word, word fre- quency, and word position), the search engine is able
to exclude documents from the matching list. More important, it is able to compute
a score for each document. The higher its score, the higher a document will be in the
result list. let’s have a look at some factors influencing its value :
• In a query involving multiple words, the closer they are in a document, the higher
the rank.
• In a query involving multiple words, the more are found in a single document,
the higher the rank.
• The higher the frequency of a matching word in a document, the higher the rank.

• The less approximate a word, the higher the rank.
Depending on how the query is expressed and how the product computes score, these
rules may or may not apply. This list is here to give you a feeling of what may affect
the score, therefore the relevance of a document. Once the ordered list of documents
is ready, the full-text search engine exposes the results to the user. It can be through
a programmatic API or through a web page. the following figure shows a result page
from the Google search engine.[Bernard2009]

21
Chapter 1

Figure 1.4: Search results returned as a web page: one of the possible ways to expose results

1.8

Semantic Approach

Semantic search seeks to improve search accuracy by understanding searcher intent and
the contextual meaning of terms as they appear in the searchable dataspace, whether
on the Web or within a closed system, to generate more relevant results. Semantic
search systems consider various points including context of search, location, intent,
variation of words, synonyms, generalized and specialized queries, concept matching
and natural language queries to provide relevant search results[Web-Techulator]. Major
web search engines like Google and Bing incorporate some elements of semantic search.
Rather than using ranking algorithms such as Google’s PageRank to predict relevancy, semantic search uses semantics, or the science of meaning in language, to
produce highly relevant search results. In most cases, the goal is to deliver the information queried by a user rather than have a user sort through a list of loosely
related keyword results. However, Google itself has subsequently also announced its
own Semantic Search project[Web-WSJ].
Other authors primarily regard semantic search as a set of techniques for retrieving
knowledge from richly structured data sources like ontologies. Such technologies enable
the formal articulation of domain knowledge at a high level of expressiveness and could
enable the user to specify his intent in more detail at query time[Web-ESWC2012].
Semantic search does not just mean contextual search or search based on the intend of the question. It include several other factors as well. A smart search engine
would consider several factors to provide the most relevant and useful search queries,

22
Chapter 1

including[Web-Techulator]:
• Current trend: If the president election was just finished in the country and
someone is searching for ’Who is the new president’, the semantic search system
should be able to understand the query and give relevant results based on the
current trend and news.
• Location of search: If a person is searching for ’what is the temperature’, the
semantic search engine should be able to provide results based on the current
location of the search. If the person is searching from California, search results
should include the current temperature in California.
• Intend of the search: Semantic search engines should be able to give appropriate search results based on the intent of the search and not based on the specific
words used in the search query.
• Variations of words in Semantic Search: Semantic search should consider
tenses, plural, singular etc and provide relevant search results for all semantic
variations of the words. For example, words like dog, dogs, dog’s etc.
• Synonyms and Semantic Search: A semantic search engine should be able
to understand the synonyms and give more or less the same search results on any
synonyms of the word users search for. For example, try searching for ”biggest
mountain” and ”highest mountain”. You would get pretty much the same results
since both of them means the same in this particular query, even though the
”biggest” and ”highest” could mean different things in different cases.
• Generalized and Specialized queries: Semantic Searching engine should be
able to set relation between generalized and specialized queries and provide appropriate and relevant results. For example, consider an article on general health
topics and another article specifically on Diabetes. If someone search for health
information, both articles could match even though the article on Diabetes does
not talk specifically about ”health”.
• Concept matching: This is a sub-set of context matching in semantic search.
Semantic search should understand the broad concept of the query and return
relevant results. For example, a query on ”Traffic problems in New Jersey” could
return relevant results including the topics ”narrow roads”, ”non functioning
traffic lights”, ”lack of roadside assistance” etc because in a broad conceptual
point of view, all of these lead to traffic problems.

23
Chapter 1

• Natural language queries: Not everyone are tech savvy and not many people
know what to search to get the relevant search results. Most users simply type
in queries in natural language. For example, if some one want to find what is
the current time in Arizona, USA, they would search for ’What time is it in
Arizona’. Most search engines would simply show results from the websites and
articles that talk about Time and Arizona. However, smart search engines that
use Semantic Search would actually show you the current time in Arizona, USA.
Try it yourself at Google search.
• Change of meaning based on the group of words: By combining different
words, the true meaning of search term could change. Consider the following
search terms:
◦ New egg health products
◦ New egg health benefits
If you search for both the above terms in Google, you would get completely
different meaning. Instead of just picking the results based on the words, Google
Search looks at it as a term and then combines with common user search pattern.
The first term returns search results primarily on the popular online shopping
website NewEgg.com and shows results of health products from that site and
similar sites. The second term shows search results for the health benefits of
Egg.
Semantic Search is a big challenge for search engines and none of them are perfect.
Most search engines have improved significantly in last few years. Search engines like
Bing and Google provide significantly relevant search results incorporating some degree
of semantic search. There are many other specialized search engines (like Hakia) which
offer purely semantic search results, but they lack many other qualities of normal search
engines.

1.9

Conclusion

In this chapter, the study focused on the working mechanism of search engines and
information retrieval systems, based on indexing due to its importance. Indeed, it is
the most important step in the search process as it allows the extraction and processing
of keywords.
The search phase does not offer only the interaction between users and the system,
but also calculates the match percentage between the query and the documents to
provide the most relevant results.
24
Chapter 2
Arabic Language

‫أَنا البحر في أَحشائ ِِه الدر كامـــن *** فهل سأَلوا الغواص عن صدَفاتـــي‬
َ َ َ َّ
ٌ ِ ُّ ُ
ُ َ
َ ََ
‫حافظ ابراهيم، قصيدة عن اللغة العربية‬

2.1

Introduction

Arabic (

‫ ) العربية‬is a name applied to the descendants of the classical Arabic language

of the sixth century AD, the most widely used in the Qur’an, the Islamic holy book.
Arabic is a Central Semitic language, closely related to modern Hebrew and Aramai
languages1 .
In this chapter, we will talk about orthography and morphology of the Arabic
language that are unique. we will also talk about some ambiguities that may appear
in Arabic due to the absence of vocalization.

2.2

Orthography

The Arabic script is one of the most used scripts all over the world. It dominates in
the Arab countries, of course, but holds a special place for all Muslims because it is
the script used to write the Qur’an.[Jabri]
It is written from right to left like other Semitic languages. Its alphabet has twentynine2 consonant letters, three of them .

‫يوا‬

are considered as vowels. Optionally,

1

Modern Aramaic, languages are varieties of Aramaic that are spoken vernaculars in the medieval
to modern era, evolving out of Middle Aramaic dialects around AD 1200.
2
The scientists of Arabic language considered the Hamzah as a letter
25
Chapter 2

one of the three Diacritical marks .

‫ـِ ُـ َـ‬

can be placed after certain characters to

resolve the ambiguity in pronunciation and/or direction when it arises. In a fully
vocalized Arabic text, the lack of diacritics can be regarded as a sokōune .

‫ْـ‬

(silence).

In some cases, a letter doubled, may be replaced by a single letter with tashdeed .
(reinforcement) placed above. [AlKharashi1999]

ّ‫ـ‬

In addition, it is important to note that the notion of uppercase and lowercase
letter does not exist, the Arabic writing is called unicameral. In addition, Arabic is
a language semi cursive, most letters are attached to each other, their spellings differ
depending on whether they are preceded and/or followed by other letters or they are
isolated. Only six of them does not attach to the following letter : .
one letter does not attach at all : .

‫ء‬

‫وزرذدا‬

and

[Mesfar2008]

Letter
Hamzah
Wâw
Ayn

Spellings

‫ء‬
‫و ـو‬
‫ـع ـعـ عـ ع‬

Table 2.1: 3 types of Arabic letters: 1 form, 2 forms or 4 forms

2.3

Lexicography

The traditional Arabic grammar has only three subsets: Nouns, Verbs and Particles.

2.3.1

Verbs

A verb is an entity expressing a time-dependent sense. Most Arabic verbs are formed

‫ك َتب‬
َ َ
and eventually four consonants that is the case of the verb . ‫دحرج‬
َ َْ َ

on three radical consonants that is the case of the verb .

(kataba – write)
(dahrağa – roll

along). These roots may form several patterns as a result of one or more morphological
transformations (eg: repetition of a consonant, lengthening a vowel, the  expanding of
a morpheme, etc.), it comes in this case to roots with augmented pattern.
Several linguistic studies have been conducted on the verbal system in Arabic,
see[Larcher2003]. In this section, it is necessary to introduce a classification of verbs
according to their radicals:
2.3.1.1 Verbs with a simple root (‫المجرد‬
ّ

‫:)الفعل‬

A verb with a simple root has a base of three consonants called radical consonants.
These verbs are associated with verbal pattern .

‫فعل‬
َ ََ

(fa’ala). When none of the root

26
Chapter 2

consonants of the verb is a long vowel, it is called healthy. These radicals may involve

ِ
processing or causes of defects (‫ ,)علَّة‬we mention :
• The presence of .

ٔ‫ا‬

(á – hamzä), .

‫ي‬

(y - yâ’) or .

‫و‬

(w – wâw) among the

radical consonants. Depending on the position of that, we distinguish different
types of verbs :
◦ If one of the root consonants is .
= Hamzated verb (‫مهموز‬
َْ

ٔ‫ا‬

(á – hamzä), independently of its position

‫; )فعل‬

‫و‬

◦ The first radical consonant is a .
(‫مثال‬
َِ

‫;)فعل‬

◦ The second radical consonant is a .

‫;)أَج َوف‬
ْ

◦ The third radical consonant is a .

ِ
‫;)نَاقص‬

‫و‬

(w) or .

‫ي‬

(y) = Assimilated verb

(w) or .

‫ي‬

(y) = Hollow verb (‫فعل‬

‫و‬

(w) or .

‫ي‬

(y) = Weakened verb (‫فعل‬

• The presence of two identical consonants in the second and third position of the
root = Geminated verb (‫مضاعف‬
َ َ ُ

‫.)فعل‬

2.3.1.2 Verbs with augmented root (‫المزيد‬

‫:)الفعل‬

The patterns of verbs with augmented root are formed from simple roots by a set of
morphological operations to provide a specific meaning to the outcome verbs , we
mention:
• .
• .
• .
• .
• .
• .
• .
• .

‫فعل‬
َّ َ
‫فاعل‬
ََ َ
‫أَفعل‬
َ َْ

(fa’ala)
(faā’ala)
(áfa’ala)

‫( َتفعل‬tafa”ala)
َ َّ َ
‫( َتفَاعل‬tafaā’ala)
ََ
‫( ا ِْف َتعل‬ìfta’ala)
َ َ

‫( اِنْفعل‬ìnfa’ala)
َ ََ
‫( اِس َتفعل‬ìstaf’ala)
َ َْ ْ

27
Chapter 2

2.3.2

Nouns

The morphological system of Arabic nouns contains three subcategories:
2.3.2.1 Primitive nouns (‫الجامدة‬

‫:)الأسماء‬

The primitive nouns are nouns that can not be attached to a verbal root. They well
form the fundamental glossary of the concrete language. eg: .
.

‫كُ ْرسي‬
ّ ِ

(kursiyy – chair), .

َ
‫ك ْبش‬

‫أخ‬

(raás – head),

(kabš – Sheep), etc. In this category we also include

nouns composed of two letters such as: .
(áab – father), .

‫رأْس‬
َ

‫دم‬

(dam - blood), .

‫فم‬

(fam - mouth), .

‫أب‬

(áh – brother), etc.

ّ
2.3.2.2 Nouns derived from verbals (‫المشتقة‬

‫: )الأسماء‬

These are the nouns that can be derived from a verbal root. The number and nature of
these forms vary depending on the status of the verb to which they relate. As nouns,
they can receive marks of case, gender and indeterminacy.
2.3.2.3 Numbers :

‫صفر‬
‫عشرون‬

This category of nouns is made up of simple numerals representing units: from .
(sifr- zero, 0) to .

‫تسعة‬

(tisat – nine, 9); the tens: .

(išruwn – twenty, 20) … and .

‫تسعون‬

‫تسعة_عشَ ر‬
َ

(ašarat – ten, 10), .

(tisuwn – ninety, 90) ; the hundreds, etc. well

as numerals compounds such as cardinals of .
to .

‫عشرة‬

(tisat ašara - nineteen, 19).

‫أَحد_عشَ ر‬
َ َ

(áahada ašara - eleven, 11)

In their decomposition, the Arab grammarians have classified adjectives to nouns
as they almost take all the morphological forms and may, for example, be definite or
indefinite and flex according to case, number and type.
2.3.2.4 Demonstrative pronouns (‫الإ شارة‬

‫:)أسماء‬

Demonstrative pronouns represent a subcategory of noun expressing an idea of demonstration. They can indicate that the object represented is found, either in the text,
either in space or time, defined by the situation of utterance. They are two subsets:
near-deictic (eg: .
.

َ‫ذلِك‬

‫هذا‬
َ

(dalika – that), .

(hadaā – this), .

َ‫أُولائِك‬

‫هؤلاء‬
َُ

(haẃulaā’ – these) and far-deictic (eg:

(ùuwlaāýika – those), etc.). Demonstratives are deriv-

able only to dual.

28
Chapter 2

2.3.2.5 Relative pronouns (‫موصولة‬

‫:) ٔاسماء‬

Relative pronouns relate to the noun or personal pronoun that precedes them and that
we denote by antecedent. The relatives shall afford with their antecedents but are
derivable only to dual (as demonstratives). Among the relative pronouns, we mention:
.

‫الَّذي‬

dual), .

(al-ladiy - that, masculine, singular), .

‫اللذين‬
َ

(those, masculine, plural), etc.

2.3.2.6 Personal pronouns (

ِ‫الل َت ْين‬

(al-latayni – those, feminine,

‫:) الضمائر المنفصلة‬

Personal pronouns are intended to identify three types of grammatical persons:
• First person, ie, the speaker (‫ , )المتكلم‬that who is talking: .
.

‫نَحن‬
ُ ْ

‫أَنَا‬

(ánaā - I) or

(nahnu – we) ;

‫( أَنْت‬áanta –
َ
ِ ْ‫( أَن‬áanti – you, feminine, singular), . ‫( أَنْ ُتما‬áanyou, masculine, singular), . ‫ت‬
َ
tumaā – you, dual), . ‫( أَنْتم‬áantum – you, masculine, plural), . ‫( أَنْتن‬áantunna
ُْ
َّ ُ

• Second person, ie, the listener (‫ ,)المخاطب‬that who talking to: .

– you, feminine, plural);

• Third person, ie, the absent (‫ , )الغائب‬that who talking about: .
.
.

2.3.3

ِ
‫هي‬
َ
‫هن‬
ُ

(hiya – she), .

‫هما‬
َُ

(humaā – they, dual), .

(hunna – they, feminine).

‫هم‬
ُْ

‫ه َو‬
ُ

(huwa – he),

(hum – they, masculine),

Function words

The function words are used to locate entities, facts or objects in relation to time or
place. They also play a key role in the coherence and sequencing of a text. For example,
we have particles that designate a time:
• .

‫بعد‬

(bada – after)

• .

‫قبل‬

(qabla – before)

• .

‫منذ‬

(mundu – since)

or a place like
• .

‫حيث‬

(haytu – where),

According to their semantic meaning and their function in the sentence, they can
play an important role in the interpretation of a sentence expressing an introduction,
explanation, consequence, etc.[Kadri1992]. Function words include various categories,
we mention:
29
Chapter 2

• Prepositions : .

ِ
‫في‬

(fiy – in) or .

• Conjunctions: .

‫ثُم‬
َّ

(tumma – then) ;

• Adverbs : .

‫أَ َبدًا‬

‫ع َلى‬
َ

(abadã – never) or .

in the normal way) ;
• Quantifiers: .

‫كل‬
ُّ

(kulla – all ) or .

(alaą – on);

ِ َ ْ
‫بِشكل_عادي‬

‫َبعض‬
ْ

(bišaklĩ aādiyyĩ – normally,

(bada – some) ;

• Etc.
The function words are divided into subgroups: those variables (quantifiers) and those
that are invariable (adverbs, prepositions, etc.).

2.4

Morphology

There are several categories of Fusional languages, and Arabic is precisely in the
category of languages with Intro-flexion: this category of languages, the consonants
indicate the meaning and vowels mark the flexion of  word. This system is found
especially in the Semitic languages (eg: Arabic, Hebrew) [Choueiter2006]
Morphologically, the Arabic language is very rich and based on the structure of
patterns and roots. Most Arabic words are generated from a finite set of roots (about
7000 roots) transformed using one or more patterns (about 400-500). Theoretically, a
single Arabic root can generate hundreds of words (noun, verb, ...). An Arabic word
can exist in about a hundred of forms in a normal text by adding certain suffixes and
prefixes (mainly considered as stop-words in English).[AlKharashi1999]

2.4.1

Flexional Morphology

Arabic uses for the declension of verbs and nouns, some indications of aspect, mood,
time, person, gender, number and case, which are generally suffixes and prefixes[Gaudefroy1975].
Generally, these flexional marks can distinguish [?] :
• Mode of verbs: eg, for the verb .

‫ذهب‬
َ َ

(dahaba – to go), forms in the Perfective

‫( ذه ْبت‬dahabatu – I went) or
ُ َ
their prefixes such in the Imperfective (‫ )المضارع‬as . ‫( أَذهب‬áadhabu – I go) ;
ُ َ
ِ ُ َ
Function of nouns: using of suffixes such as . ‫( رجلان‬rağulaáni – Two men) in
Nominative (‫ )حالة الرفع‬or . ِ‫( رجلين‬rağulayni – Two men) in Accusative (‫حالة‬
َْ ُ َ
‫ )النصب‬or Genitive (‫]?[ .)حالة الجر‬
(‫ )الماضي‬can be identified using their suffixes as .

•

30
Chapter 2

2.4.1.1 Flexion of verbs
Called also Conjugation, it describes the variation in their forms according to circumstances. Generally, conjugation includes a number of values which are:
• Aspect: The aspect is a grammar feature associated, in most cases, to verbs in
order to indicate which state it expresses; considered from the perspective of its
development (beginning, progress, completion, overall evolution, etc.), regardless
of when it comes ;
• Mood : Mood indicates how the action expressed by the verb is designed and
presented. The action can be doubted, affirmed as actual or eventual. They
combine the semantics of verbs and thereby create aspects ;
• Tense : Tense is a grammatical feature to locate a fact (which may be a state or
action) in the enunciation time axis relative to the three markers: past, present
and future. The temporal indications are often accompanied by aspectual indications that are more or less related.
These three key values are closely related ; they can describe two basic forms of the
verb in Arabic :
• Perfective (‫ :)الماضي‬it indicates that the progress of the action expressed by
the verb is finished, which means the past. It is characterized by adding suffixes
of person, gender, number and mood to the verb’s stem. For example, for the
feminine plural of the verb .
get the form .

‫ك َت ْبن‬
َ َ

‫ك َتب‬
َ َ

(kataba – to write), we add the suffix .

‫ن‬
َ

to

(katabna - *they* wrote, feminine) and for the masculine

plural, we add the suffix .

‫وا‬

masculine) ;

to get the form .

َ
‫ك َت ُبوا‬

(katabuwá - *they* wrote,

• Imperfective (‫ :)المضارع‬it indicates an unfinished progress, which may imply
the present. It is characterized by adding a prefix and one or more infixes as a
letter duplication or a vowel substitution. For example, for the verb .
– to give), we can get .

ُّ ُ
‫أَمد‬

(áamuddu – I give) or .

‫َيمددن‬
ُْ ْ

َّ َ
‫مد‬

(madda

(yamdudna – they

gives, feminine). It includes two types of modal inflections:

◦ The indicative of actual mode where the speaker states the actual character
(reread, to be achieved, in progress, etc.) of action or state expressed by the
verb;
◦ The subjunctive of potential mode in which the speaker merely states the
possible or virtual nature of action or state expressed by the verb.
31
Chapter 2

• Imperative (‫ :)الأمر‬it expresses the order, command, or exhortation ... etc. It
exists only the with the 2nd person in singular, dual and plural;
2.4.1.2 Flexion of nouns
In Arabic, the declension (‫ )إعراب‬of nouns involves three cases: Nominative (‫,)مرفُوع‬
َْ
Accusative (‫ )منصوب‬and Genitive (‫ .)مجرور‬Except for some special cases, the nouns are
ْ
ُ ْ
declinables (‫ )معربة‬and appear in one of these three cases according to their functions
in the sentence. In terms of the spelling, the case represents only an assistant graphic
at the end of nominal forms. The nominal system of Arabic admits different systems
depending on the nature of variation of the form (triptote, diptote, etc.) and the number
thereof (singular, dual or plural). We can distinguish :
2.4.1.2.1 Declension of singular nouns:
Basic declension of triptotes (‫ :)منصرف‬This is the most frequent case, it takes
the vowel .

‫ضمة‬
َّ َ

(dammat – u) as a sign of the nominative , the vowel .

a) in the accusative and the vowel .

‫كَس َرة‬
ْ

.

‫ة‬

ٍ‫ـ‬

(ta) or by .

(fathat –

(kasrat – i) in the genitive. When the noun is

undefined, the tanwîn is marked respectively by the three diacritics: .
(ã - an) et .

‫ف ْتحة‬
َ َ

ٌ‫ـ‬

(ũ – un), .

‫ًـ‬

(ĩ – in). In the indefinite accusative , except the case of nouns ending by

‫ا‬

(ā’), an álif .

‫ا‬

(ā) strengthens the tanwîn .

the accusative indefinite the noun .

ِ
‫ك َتاب‬

(an) : for example, in

(kitaāb – book) becomes .

book, accusative, indefinite) and the book .
(ğaziyratã – island, accusative, indefinite).
Declension of diptotes(‫الصرف‬

‫ًـ‬

‫جزِيرة‬
َ َ

ِ
‫ك َتا ًبا‬

(kitaābãā –

(ğaziyrat – island) becomes .

‫:)الممنوع من‬

‫جزِيرة‬
ًَ َ

The nouns that are diptotes, gram-

matically undefined, do not accept tanwîn and take the same mark in the accusative
and the genitive which is the .

‫ف ْتحة‬
َ َ

(fathat – a). By contrary, when they are defined,

they follow the declension of triptotes. This is the case of feminine nouns that end

‫( اء‬ā’) such as . ‫( صح َراء‬sahraā’ – desert), masculine adjectives of colors with
ْ
ْ
the pattern . ‫( أَفعل‬afal) such as . ‫( أحمر‬áhmar – red) and those which are feminine
َ ْ
with the pattern . ‫( فعلَاء‬falaā’) such as . ‫( بيضاء‬baydaā’ – white , feminine)
َ َْ
َْ
with

Declension of The Five Nouns (‫الخمسة‬
• Three nouns: .

‫ٔابو‬

(áabuw - father), .

‫ٔاخو‬

‫:) الأسماء‬

The five nouns are :

(áhuw - brother) and .

‫حمو‬

(hamuw

- stepfather) ;
• A variant of .

َ
‫فم‬

(fam – mouth) : .

َ
‫فا‬

,.

‫فو‬

and .

ِ
‫في‬

;

32
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an
Proposal of an Advanced Retrieval System for Noble Qur’an

More Related Content

What's hot

Demand for Army JTF Headquarters Over Time
Demand for Army JTF Headquarters Over TimeDemand for Army JTF Headquarters Over Time
Demand for Army JTF Headquarters Over TimeBonds Tim
 
REPORT on OUTREACH PROGRAMME: FEB – MARCH 2017
REPORT on OUTREACH PROGRAMME: FEB – MARCH 2017REPORT on OUTREACH PROGRAMME: FEB – MARCH 2017
REPORT on OUTREACH PROGRAMME: FEB – MARCH 2017Dilip Barad
 
Operational risk
Operational riskOperational risk
Operational riskswati raj
 
Air Force Enhancing Performance Under Stress
Air Force Enhancing Performance Under StressAir Force Enhancing Performance Under Stress
Air Force Enhancing Performance Under StressJA Larson
 
Thinking activity
Thinking activity Thinking activity
Thinking activity JetalDhapa
 
Wireless Geolocation
Wireless GeolocationWireless Geolocation
Wireless GeolocationFatema Zohora
 
School library management system software
School library management system softwareSchool library management system software
School library management system softwareRanganath Shivaram
 
ZSSS_End of Project Evaluation Report
ZSSS_End of Project Evaluation ReportZSSS_End of Project Evaluation Report
ZSSS_End of Project Evaluation ReportClaudios Hakuna
 
1263_protecting_education_in_the_mena_region_-_full_report_-_june_2016__web_v...
1263_protecting_education_in_the_mena_region_-_full_report_-_june_2016__web_v...1263_protecting_education_in_the_mena_region_-_full_report_-_june_2016__web_v...
1263_protecting_education_in_the_mena_region_-_full_report_-_june_2016__web_v...Shaza Salamoni
 
dissertation (Dr. Myo NA)
dissertation (Dr. Myo NA)dissertation (Dr. Myo NA)
dissertation (Dr. Myo NA)Myo Aye
 
Gallaghers' i2 guidebook_abridged.
Gallaghers' i2 guidebook_abridged.Gallaghers' i2 guidebook_abridged.
Gallaghers' i2 guidebook_abridged.James Gallagher
 
Gallaghers' i2 guidebook
Gallaghers' i2 guidebookGallaghers' i2 guidebook
Gallaghers' i2 guidebookJames Gallagher
 
HSK 9 Chinese Vocabulary List V2020 Sample
HSK 9 Chinese Vocabulary List V2020 SampleHSK 9 Chinese Vocabulary List V2020 Sample
HSK 9 Chinese Vocabulary List V2020 SampleLEGOO MANDARIN
 
Barriers to civility paper
Barriers to civility paperBarriers to civility paper
Barriers to civility paperJenny Erkfitz
 

What's hot (20)

Demand for Army JTF Headquarters Over Time
Demand for Army JTF Headquarters Over TimeDemand for Army JTF Headquarters Over Time
Demand for Army JTF Headquarters Over Time
 
Slima thesis carnegie mellon ver march 2001
Slima thesis carnegie mellon ver march 2001Slima thesis carnegie mellon ver march 2001
Slima thesis carnegie mellon ver march 2001
 
Spe 111 physical education manual
Spe 111 physical education manualSpe 111 physical education manual
Spe 111 physical education manual
 
Thesis
ThesisThesis
Thesis
 
Modbuspollmanual
ModbuspollmanualModbuspollmanual
Modbuspollmanual
 
REPORT on OUTREACH PROGRAMME: FEB – MARCH 2017
REPORT on OUTREACH PROGRAMME: FEB – MARCH 2017REPORT on OUTREACH PROGRAMME: FEB – MARCH 2017
REPORT on OUTREACH PROGRAMME: FEB – MARCH 2017
 
Operational risk
Operational riskOperational risk
Operational risk
 
Air Force Enhancing Performance Under Stress
Air Force Enhancing Performance Under StressAir Force Enhancing Performance Under Stress
Air Force Enhancing Performance Under Stress
 
Thinking activity
Thinking activity Thinking activity
Thinking activity
 
Wireless Geolocation
Wireless GeolocationWireless Geolocation
Wireless Geolocation
 
School library management system software
School library management system softwareSchool library management system software
School library management system software
 
UNFPA CP2 Evaluation report
UNFPA CP2 Evaluation reportUNFPA CP2 Evaluation report
UNFPA CP2 Evaluation report
 
ZSSS_End of Project Evaluation Report
ZSSS_End of Project Evaluation ReportZSSS_End of Project Evaluation Report
ZSSS_End of Project Evaluation Report
 
1263_protecting_education_in_the_mena_region_-_full_report_-_june_2016__web_v...
1263_protecting_education_in_the_mena_region_-_full_report_-_june_2016__web_v...1263_protecting_education_in_the_mena_region_-_full_report_-_june_2016__web_v...
1263_protecting_education_in_the_mena_region_-_full_report_-_june_2016__web_v...
 
dissertation (Dr. Myo NA)
dissertation (Dr. Myo NA)dissertation (Dr. Myo NA)
dissertation (Dr. Myo NA)
 
Gallaghers' i2 guidebook_abridged.
Gallaghers' i2 guidebook_abridged.Gallaghers' i2 guidebook_abridged.
Gallaghers' i2 guidebook_abridged.
 
Gallaghers' i2 guidebook
Gallaghers' i2 guidebookGallaghers' i2 guidebook
Gallaghers' i2 guidebook
 
HSK 9 Chinese Vocabulary List V2020 Sample
HSK 9 Chinese Vocabulary List V2020 SampleHSK 9 Chinese Vocabulary List V2020 Sample
HSK 9 Chinese Vocabulary List V2020 Sample
 
Thesis Index
Thesis IndexThesis Index
Thesis Index
 
Barriers to civility paper
Barriers to civility paperBarriers to civility paper
Barriers to civility paper
 

Similar to Proposal of an Advanced Retrieval System for Noble Qur’an

Composition of Semantic Geo Services
Composition of Semantic Geo ServicesComposition of Semantic Geo Services
Composition of Semantic Geo ServicesFelipe Diniz
 
Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media LayerLinkedTV
 
RAPID RURAL APPRAISAL (RRA) AND PARTICIPATORY RURAL APPRAISAL (PRE) - A MANUA...
RAPID RURAL APPRAISAL (RRA) AND PARTICIPATORY RURAL APPRAISAL (PRE) - A MANUA...RAPID RURAL APPRAISAL (RRA) AND PARTICIPATORY RURAL APPRAISAL (PRE) - A MANUA...
RAPID RURAL APPRAISAL (RRA) AND PARTICIPATORY RURAL APPRAISAL (PRE) - A MANUA...Ayda.N Mazlan
 
An Introduction To Text-To-Speech Synthesis
An Introduction To Text-To-Speech SynthesisAn Introduction To Text-To-Speech Synthesis
An Introduction To Text-To-Speech SynthesisClaudia Acosta
 
Uni cambridge
Uni cambridgeUni cambridge
Uni cambridgeN/A
 
Applying The Rapid Serial Visual Presentation Technique To Small Screens
Applying The Rapid Serial Visual Presentation Technique To Small ScreensApplying The Rapid Serial Visual Presentation Technique To Small Screens
Applying The Rapid Serial Visual Presentation Technique To Small ScreensMonica Waters
 
Modelling Time in Computation (Dynamic Systems)
Modelling Time in Computation (Dynamic Systems)Modelling Time in Computation (Dynamic Systems)
Modelling Time in Computation (Dynamic Systems)M Reza Rahmati
 
optimization and preparation processes.pdf
optimization and preparation processes.pdfoptimization and preparation processes.pdf
optimization and preparation processes.pdfThanhNguyenVan84
 
Techniques_of_Variational_Analysis.pdf
Techniques_of_Variational_Analysis.pdfTechniques_of_Variational_Analysis.pdf
Techniques_of_Variational_Analysis.pdfNguyenTanBinh4
 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analyticsLeon Henry
 
Learn python the right way
Learn python the right wayLearn python the right way
Learn python the right wayDianaLaCruz2
 
Protege owl tutorialp3_v1_0
Protege owl tutorialp3_v1_0Protege owl tutorialp3_v1_0
Protege owl tutorialp3_v1_0Patricia Silva
 

Similar to Proposal of an Advanced Retrieval System for Noble Qur’an (20)

Composition of Semantic Geo Services
Composition of Semantic Geo ServicesComposition of Semantic Geo Services
Composition of Semantic Geo Services
 
Sanskrit Parser Report
Sanskrit Parser ReportSanskrit Parser Report
Sanskrit Parser Report
 
Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media Layer
 
cs-2002-01
cs-2002-01cs-2002-01
cs-2002-01
 
Fraser_William
Fraser_WilliamFraser_William
Fraser_William
 
RAPID RURAL APPRAISAL (RRA) AND PARTICIPATORY RURAL APPRAISAL (PRE) - A MANUA...
RAPID RURAL APPRAISAL (RRA) AND PARTICIPATORY RURAL APPRAISAL (PRE) - A MANUA...RAPID RURAL APPRAISAL (RRA) AND PARTICIPATORY RURAL APPRAISAL (PRE) - A MANUA...
RAPID RURAL APPRAISAL (RRA) AND PARTICIPATORY RURAL APPRAISAL (PRE) - A MANUA...
 
An Introduction To Text-To-Speech Synthesis
An Introduction To Text-To-Speech SynthesisAn Introduction To Text-To-Speech Synthesis
An Introduction To Text-To-Speech Synthesis
 
Uni cambridge
Uni cambridgeUni cambridge
Uni cambridge
 
Applying The Rapid Serial Visual Presentation Technique To Small Screens
Applying The Rapid Serial Visual Presentation Technique To Small ScreensApplying The Rapid Serial Visual Presentation Technique To Small Screens
Applying The Rapid Serial Visual Presentation Technique To Small Screens
 
Modelling Time in Computation (Dynamic Systems)
Modelling Time in Computation (Dynamic Systems)Modelling Time in Computation (Dynamic Systems)
Modelling Time in Computation (Dynamic Systems)
 
dissertation
dissertationdissertation
dissertation
 
optimization and preparation processes.pdf
optimization and preparation processes.pdfoptimization and preparation processes.pdf
optimization and preparation processes.pdf
 
Perltut
PerltutPerltut
Perltut
 
Techniques_of_Variational_Analysis.pdf
Techniques_of_Variational_Analysis.pdfTechniques_of_Variational_Analysis.pdf
Techniques_of_Variational_Analysis.pdf
 
Lesson 1...Guide
Lesson 1...GuideLesson 1...Guide
Lesson 1...Guide
 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analytics
 
IBM Watson Content Analytics Redbook
IBM Watson Content Analytics RedbookIBM Watson Content Analytics Redbook
IBM Watson Content Analytics Redbook
 
2013McGinnissPhD
2013McGinnissPhD2013McGinnissPhD
2013McGinnissPhD
 
Learn python the right way
Learn python the right wayLearn python the right way
Learn python the right way
 
Protege owl tutorialp3_v1_0
Protege owl tutorialp3_v1_0Protege owl tutorialp3_v1_0
Protege owl tutorialp3_v1_0
 

More from Assem CHELLI

How to get in GSoC , DevFest Algiers 2018
How to get in GSoC , DevFest Algiers  2018How to get in GSoC , DevFest Algiers  2018
How to get in GSoC , DevFest Algiers 2018Assem CHELLI
 
Dev environment for linux (Mainly KDE and python)
Dev environment for linux  (Mainly KDE and python)Dev environment for linux  (Mainly KDE and python)
Dev environment for linux (Mainly KDE and python)Assem CHELLI
 
تجربتي مع المساهمة في المشاريع الحرة - اليوم الحر
تجربتي مع المساهمة  في المشاريع الحرة - اليوم الحر تجربتي مع المساهمة  في المشاريع الحرة - اليوم الحر
تجربتي مع المساهمة في المشاريع الحرة - اليوم الحر Assem CHELLI
 
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending  Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending Assem CHELLI
 
Global Schema for Alfanous Quran Search Engine
Global Schema for Alfanous Quran Search EngineGlobal Schema for Alfanous Quran Search Engine
Global Schema for Alfanous Quran Search EngineAssem CHELLI
 
Alfanous Quran Search Engine API
Alfanous Quran Search Engine APIAlfanous Quran Search Engine API
Alfanous Quran Search Engine APIAssem CHELLI
 

More from Assem CHELLI (8)

How to get in GSoC , DevFest Algiers 2018
How to get in GSoC , DevFest Algiers  2018How to get in GSoC , DevFest Algiers  2018
How to get in GSoC , DevFest Algiers 2018
 
Dev environment for linux (Mainly KDE and python)
Dev environment for linux  (Mainly KDE and python)Dev environment for linux  (Mainly KDE and python)
Dev environment for linux (Mainly KDE and python)
 
Python Workshop
Python  Workshop Python  Workshop
Python Workshop
 
تجربتي مع المساهمة في المشاريع الحرة - اليوم الحر
تجربتي مع المساهمة  في المشاريع الحرة - اليوم الحر تجربتي مع المساهمة  في المشاريع الحرة - اليوم الحر
تجربتي مع المساهمة في المشاريع الحرة - اليوم الحر
 
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending  Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
 
Global Schema for Alfanous Quran Search Engine
Global Schema for Alfanous Quran Search EngineGlobal Schema for Alfanous Quran Search Engine
Global Schema for Alfanous Quran Search Engine
 
Alfanous Quran Search Engine API
Alfanous Quran Search Engine APIAlfanous Quran Search Engine API
Alfanous Quran Search Engine API
 
OpenSSH tricks
OpenSSH tricksOpenSSH tricks
OpenSSH tricks
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

Proposal of an Advanced Retrieval System for Noble Qur’an

  • 1. Advanced Search/Indexing in Holy-Quran Assem Chelli 2011 / 2012
  • 2. Ministry of Higher Education and Scientific Research National Higher School of Computer Science Thesis Of Magister Option : Mobile Destributed Computing (IRM) Proposal of an Advanced Retrieval System for Noble Qur’an Written By : Supervised By : • Assem CHELLI • Pr. Amar BALLA • Mr. Taha ZERROUKI 2011/ 2012
  • 3. ... and say : O my Lord ! have compassion on them, as they brought me up (when I was) little. – Al-isra’ 24 iii
  • 4. Acknowledgment First at all, I am thanking Allah, the Almighty for giving me strength and patience to write this modest thesis. We gratefully acknowledge Pr. Amar Balla and Mr. Taha Zerrouki for giving me the honor of their supervising during that year and guiding me with advices, and meaningful criticism. I also thank the jury members for agreeing to evaluate our modest work. A big thanks to the faculty and the administration of the National Higher School of Computer Science (ESI) who took care of my training and monitoring throughout the study program. Our deepest thanks go to the Arab open source community that has offered me a great support, especially Alfanous Team/Community that makes a valuable contribution to carry out this great work, and I hope to be worthy of the confidence they have placed on me. Finally I express my appreciation to all who contributed by their advice and their encouragement to the completion of this work, my family and my friends for their assistance and support. iv
  • 5. Abstract Noble Quran is different of all documents that we have known. It’s the sacred book of Muslims. It contains knowledge of all aspects of life. With this huge quantity of information, we can extract only a small part manually and this is considered insufficient compared to the size of knowledge contained by Quran. That raises the need for a method to extract those information because currently there is no efficient method except many printed lexicons and many tools of simple sequential search with regular expression. Due to this limitation, the Quran requires us to find new ways to interact. The goal through this work is to propose a system for advanced research in all of the information contained in the Quran by considering the morphology of the Arabic language and the properties of the Qur’anic text. It should be based on modern methods of information retrieval for good stability and high speed search. It would be very useful for researchers and could be generalized to cover all the content in Arabic. Keywords : Indexing/Search, Arabic, Holy Quran, Information retrieval, Search engines. v
  • 6. Résumé Le Coran est différent de tous les documents que nous connaissons . C’est le livre sacré des musulmans. Il comporte des connaissances sur tous les aspects de la vie. Avec un tel volume d’informations, on ne peut y extraire qu’une infime partie manuellement. Ceci s’avère être insuffisant vue la quantité de connaissances que contient le Coran. D’où la nécessité de trouver une méthode pour extraire ces informations. Or il n’existe aucun outil à utiliser sauf quelques lexiques imprimés et quelques outils de recherche simple et séquentielle par les expressions régulières. En raison de cette limitation, le Coran nous oblige à trouver de nouvelles façons d’interaction. Le but recherché à travers ce travail est de proposer un système avancé de recherche dans l’ensemble des informations contenues dans le Coran en prenant en considération la morphologie de la langue Arabe et les propriétés du texte coranique. Elle doit être fondée sur les méthodes modernes de recherche d’informations pour obtenir une bonne stabilité et une recherche de grande vitesse. Elle serait trés utile pour les chercheurs et pourrait être généralisée pour couvrir l’ensemble du contenu en arabe. Mots clés : Indexation/Recherche, Arabe, Coran, Recherche d’information, Moteurs de recherche. vi
  • 7. ‫ملخّص‬ ‫القرآن الكريم يختلف عن جميع الوثائق التي نعرفها فهو يحوي المعارف في جميع جوانب الحياة. مع‬ ‫هذا الحجم من المعلومات، لا يستطيع المرء استخراج إلا النزر اليسير يدويا وهذا ليس كافيا بالنسبة‬ ‫لحجم المعارف الواردة في القرآن الكريم. ومن هنا جاءت الحاجة إلى إيجاد طريقة لاستخراج هذه‬ ‫المعلومات. لا توجد أي وسيلة حالية فعالة باستثناء بعض المعاجم المطبوعة وبعض الأدوات التي‬ ‫تعتمد البحث البسيط التسلسلي بالعبارات النمطية. وبسبب هذا القيد، يتوجب علينا إيجاد طرق‬ ‫جديدة للتفاعل.‬ ‫الهدف من خلال هذا العمل هو اقتراح نظام متقدم للبحث في جميع المعلومات الواردة في‬ ‫القرآن الكريم، مع الأخذ بعين الاعتبار مورفولوجيا اللغة العربية وخصائص النص القرآني. وينبغي‬ ‫الاستناد إلى الأساليب الحديثة في استرجاع المعلومات من أجل تحصيل استقرار جيد وسرعة عالية‬ ‫في البحث. هذا العمل مفيد للباحثين ودارسي القرآن ويمكن أن يعمم ليشمل جميع المحتوى باللغة‬ ‫َّ‬ ‫العربية.‬ ‫كلمات مفتاحية: بحث/فهرسة، العربية، القرآن، استخراج المعلومات، محركات البحث.‬ ‫‪vii‬‬
  • 8. Contents Dedication iii Acknowledgment iv Table of Contents xv List of Figures xvii List of Tables xix List of Abbreviations xx Glossary xxiv General Introduction 1 I State Art 4 1 Search engines 5 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1 Keyword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.2 Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.3 Document 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
  • 9. 1.2.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.5 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Search engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Full-text search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.1 Crawler Features . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5.1.1 Features a crawler must provide . . . . . . . . . . . . 9 1.5.1.2 Features a crawler should provide . . . . . . . . . . . 9 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6.2 Indexing modes . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.6.2.1 Manual indexing . . . . . . . . . . . . . . . . . . . . . 11 1.6.2.2 Automatic indexing . . . . . . . . . . . . . . . . . . . 12 1.6.2.3 Semi-automatic indexing . . . . . . . . . . . . . . . . 13 Index types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6.3.1 Document Index . . . . . . . . . . . . . . . . . . . . 13 1.6.3.2 Forward Index . . . . . . . . . . . . . . . . . . . . . . 14 1.6.3.3 Inverted index . . . . . . . . . . . . . . . . . . . . . . 14 1.6.3.4 N-gram index . . . . . . . . . . . . . . . . . . . . . . . 15 1.6.4 Index storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.6.5 Index update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6.5.1 Incremental update . . . . . . . . . . . . . . . . . . . 16 1.6.5.2 Global update . . . . . . . . . . . . . . . . . . . . . . 16 Indexing phases . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6.6.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . 16 1.6.6.2 Normalization . . . . . . . . . . . . . . . . . . . . . . 17 1.6.6.3 Elimination of stop-words . . . . . . . . . . . . . . . . 17 1.6.6.4 Weighting . . . . . . . . . . . . . . . . . . . . . . . . . 17 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.7.1 Relevance concept . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.7.2 Similarity Function . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.7.3 Search process . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.8 Semantic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.6 1.6.3 1.6.6 1.7 2 Arabic Language 25 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Orthography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 ix
  • 10. 2.3 Lexicography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 26 Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1.1 ‫. :)الفعل‬ Verbs with augmented root (‫:)الفعل المزيد‬ . . . . . . . 26 . . . . . . . 27 Nouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.1.2 2.3.2 Verbs with a simple root (‫المجرد‬ ّ 2.3.2.1 ‫. . . . . . :)الأسماء‬ ّ Nouns derived from verbals (‫: )الأسماء المشتقة‬ Primitive nouns (‫الجامدة‬ . . . . . 28 . . . . 28 2.3.2.3 Numbers : . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2.4 Demonstrative pronouns (‫الإ شارة‬ . . . . . . . . 28 . . . . . . . . . 29 . . . . . . . . 29 Function words . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.1 Flexional Morphology . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.1.1 Flexion of verbs . . . . . . . . . . . . . . . . . . . . . 31 2.4.1.2 Flexion of nouns . . . . . . . . . . . . . . . . . . . . . 32 2.4.1.3 Flexion of function words . . . . . . . . . . . . . . . . 34 Derivational morphology . . . . . . . . . . . . . . . . . . . . . . 34 2.3.2.2 2.3.2.5 2.3.2.6 2.3.3 2.4 2.4.2 2.4.2.1 2.4.2.2 ‫:) ٔاسماء‬ ‫. . :) أسماء‬ Personal pronouns ( ‫:) الضمائر المنفصلة‬ Relative pronouns (‫موصولة‬ Deverbal noun (‫:)المصدر‬ Active participle (‫فاعل‬ . . . . . . . . . . . . . . . . ‫. . . . . . . . . . :)اسم‬ Passive participle (‫. . . . . . . . . :)اسم مفعول‬ Nouns of time and place (‫:)أسماء الزمان والمكان‬ Noun of instrument (‫. . . . . . . . . :)اسم الآلة‬ The Nomen Vicis (‫. . . . . . . . . . :)اسم المرة‬ The Nomen Speciei (‫. . . . . . . . . :)اسم الهيئة‬ 35 . . . . 35 . . . . 35 . . . . 35 . . . . 35 . . . . 36 . . . . 36 Ambiguity issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.1 The absence of vocalization . . . . . . . . . . . . . . . . . . . . 36 2.5.2 Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.3 Suffixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.6 The computerization of  Arabic language . . . . . . . . . . . . . . . . . 39 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.4.2.3 2.4.2.4 2.4.2.5 2.4.2.6 2.4.2.7 2.5 3 The Qur’an 41 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Qur’an Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.1 Fragmentation into surahs . . . . . . . . . . . . . . . . . . . . . 43 3.3.2 Fragmentation into Hizbs . . . . . . . . . . . . . . . . . . . . . 43 x
  • 11. 3.3.3 44 Qur’anic Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.1 Knowledge of Ayahs revelation places . . . . . . . . . . . . . . 45 3.4.2 Knowledge of Ayahs revelation causes: . . . . . . . . . . . . . . 46 3.4.3 Knowledge of Morphology: 46 3.4.4 3.4 Fragmentation into Stops (Waqfs) . . . . . . . . . . . . . . . . . ‫. . . . . . : )علم مرسوم‬ Grammatical analysis of the Qur’an (‫: )إعراب ألفاظ القرآن‬ Science of allegorical ayahs (‫. . . . . . . . : )علم المتشابه‬ The beginnings of surahs (‫. . . . . . . . . . :)فواتح السور‬ ّ Knowledge of Qur’anic Parables (‫. . . . . :)الأمثال القرآنية‬ Tafssīr (‫. . . . . . . . . . . . . . . . . . . . . . :)التفسير‬ . . . . . . . . . . . . . . . . . . . . ّ Knowledge of Orthography (‫الخط‬ . . . . 47 . . . 48 . . . . 48 . . . . 48 . . . . 49 . . . . 49 Computerization of the Qur’an . . . . . . . . . . . . . . . . . . . . . . 49 3.5.1 Advantages of Computerization . . . . . . . . . . . . . . . . . . 50 Qur’an Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.1.1 Indexing words of the Qur’an . . . . . . . . . . . . . . 51 3.6.1.2 Ma’ājim of Qur’anic words . . . . . . . . . . . . . . . 51 3.6.1.3 Specialized Ma’ājim of Qur’anic words . . . . . . . . 52 3.6.1.4 Qur’anic Indexes and Computer . . . . . . . . . . . . 52 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.2.1 By unit: . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.2.2 By purpose . . . . . . . . . . . . . . . . . . . . . . . . 56 Projects of building indexes . . . . . . . . . . . . . . . . . . . . 58 3.6.3.1 Midād lbayān 58 3.6.3.2 Indexes by Taha Zerrouki 3.6.3.3 Qur’anic Arabic Corpus 3.4.5 3.4.6 3.4.7 3.4.8 3.4.9 3.5 3.6 3.6.2 3.6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Tanzil Project . . . . . . . . . . . . . . . . . . . . . . 62 3.6.3.5 Boundary-Annotated Qur’an Corpus . . . . . . . . . . 64 3.6.3.6 Qurany Concepts Tool . . . . . . . . . . . . . . . . . . 65 Qur’an Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.7.1 Qur’anic Concepts Ontology . . . . . . . . . . . . . . . . . . . 67 3.7.2 3.8 59 3.6.3.4 3.7 . . . . . . . . . . . . . The Ontology made by Hadj Henni: . . . . . . . . . . . . . . 68 Qur’anic Search Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.8.1 69 Alawfa (‫. . . . . . . . . . . . . . . . . . . . . . . . . . . )الأوفى‬ 3.8.2 Al-Monaqeb-Alqurany (‫القرآني‬ ‫)المنقب‬ . . . . . . . . . . . . . . 70 3.8.3 Quran complex search service . . . . . . . . . . . . . . . . . . . 71 3.8.4 Quranic Researcher (‫القرآني‬ 72 ‫)الباحث‬ . . . . . . . . . . . . . . . . xi
  • 12. 3.8.5 Quranologie (‫القرآن‬ . . . . . . . . . . . . . . . . . . . . . . 73 3.8.6 Quranic Corpus Word-by-Word Search . . . . . . . . . . . . . . 74 3.8.7 Tanzil (‫. . . . . . . . . . . . . . . . . . . . . . . . . . . . )تنزيل‬ 75 Zekr (‫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . )ذكر‬ 76 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.8.8 3.9 II ‫)علم‬ Analysis & Conception 78 4 Classification & Proposition of Qur’anic Search Features 79 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 Difficulties of Search in Quran . . . . . . . . . . . . . . . . . . . . . . 79 4.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4 Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4.1 Advanced Query . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4.2 Output Improvements . . . . . . . . . . . . . . . . . . . . . . . 84 4.4.3 Suggestion Systems . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.4 Linguistic Aspects . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.4.5 Qur’anic Options . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.6 Semantic Queries . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.7 Statistical System . . . . . . . . . . . . . . . . . . . . . . . . . 96 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.5.1 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5.1.1 Survey Participants Details . . . . . . . . . . . . . . . 97 4.5.1.2 4.6 Survey Results of survey . . . . . . . . . . . . . . . . . . . . . 99 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5 Conception 102 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2.1 Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2.3 Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.4 Results Processing . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.5 Indexes Importing . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.3 Full vocalized search engine . . . . . . . . . . . . . . . . . . . . . . . . 107 5.4 Othmani script and text processing . . . . . . . . . . . . . . . . . . . . 111 5.4.1 Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.4.1.1 Romanizations . . . . . . . . . . . . . . . . . . . . . . 113 xii
  • 13. 5.4.1.2 Numbers into words: . . . . . . . . . . . . . . . . . . 114 5.4.2 5.4.3 Normalization: . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.4.4 Filtering stop-words: . . . . . . . . . . . . . . . . . . . . . . . . 120 5.4.5 5.5 Tokenization: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Lemmatization: . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Qur’anic Word Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.5.1 5.5.2 Semantically Related Words . . . . . . . . . . . . . . . . . . . . 125 5.5.3 Multi-level Derivations . . . . . . . . . . . . . . . . . . . . . . . 126 5.5.4 Specific Derivations . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.5.5 5.6 Word properties search . . . . . . . . . . . . . . . . . . . . . . . 124 Fuzzy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 III Implementation 6 Implementation 130 131 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.2 Why Open Source? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.2.1 License : AGPL . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.2.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.2.3 Whoosh Search API . . . . . . . . . . . . . . . . . . . . . . . . 134 6.3 Previous Code Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.4 Our improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.4.1 A New Centralized JSON Output System: . . . . . . . . . . . . 138 6.4.2 Many new features . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.4.3 Resource Importing Manager . . . . . . . . . . . . . . . . . . . 142 6.4.4 Automating the API building . . . . . . . . . . . . . . . . . . . 143 6.4.5 A new console interface . . . . . . . . . . . . . . . . . . . . . . 143 6.4.6 Enhancing the web interface . . . . . . . . . . . . . . . . . . . . 143 6.4.7 Packaging system: . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.4.8 Multiple search units . . . . . . . . . . . . . . . . . . . . . . . . 144 6.4.9 Coding Standardization . . . . . . . . . . . . . . . . . . . . . . 145 6.4.10 Documentation covering . . . . . . . . . . . . . . . . . . . . . . 146 6.4.11 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.5 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.5.1 Application Programming Interface . . . . . . . . . . . . . . . . 150 6.5.1.1 JSON web service . . . . . . . . . . . . . . . . . . . . 152 xiii
  • 14. 6.5.1.2 6.6 Console interface . . . . . . . . . . . . . . . . . . . . 153 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 General Conclusion 156 Bibliography 158 Appendices A1 Annex A: Paper Abstracts A2 xiv
  • 15. List of Figures 1.1 The various components of a web search engine . . . . . . . . . . . . . 9 1.2 Indexing Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3 Search process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.4 Search results returned as a web page: one of the possible ways to expose results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1 Ligature of lâm and álif. . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Explanatory diagram of fragmentation into Surahs . . . . . . . . . . . 43 3.2 Explanatory diagram of fragmentation into Hizbs . . . . . . . . . . . . 44 3.3 Index using the word as a unit [Arabic Quranic Corpus] . . . . . . . . 53 3.4 Index that takes the ayah as a unit [Arabeyes Quran Model] . . . . . . 54 3.5 Classification indexes by purpose . . . . . . . . . . . . . . . . . . . . . 56 3.6 The various structures of Qur’an . . . . . . . . . . . . . . . . . . . . . 57 3.7 Preview of Qurany . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.8 A closer look at Qur’an Concepts Ontology 68 3.9 Diagram of domain ontology of Qur’anic documents made by Hadj . . . . . . . . . . . . . . . Henni[Hadjhenni2008] . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.10 Preview of Alawfa website: www.alawfa.com . . . . . . . . . . . . . . . 70 3.11 Preview of Al Monaqeb Alqurany . . . . . . . . . . . . . . . . . . . . . 71 3.12 Preview of Quran Complex Search page . . . . . . . . . . . . . . . . . 72 3.13 Preview of Quranic Researcher (www.quranicresearcher.com) . . . . . . 73 3.14 Preview of Quranologie (quranologie.com) . . . . . . . . . . . . . . . . 74 3.15 Preview of Qur’anic Arabic Corpus Word By Word search . . . . . . . 75 3.16 Preview of Tanzil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.17 Preview of Zekr Application . . . . . . . . . . . . . . . . . . . . . . . . 77 4.1 84 Pages view in Google.com . . . . . . . . . . . . . . . . . . . . . . . . . xv
  • 16. 4.2 Highlight the keyword ‫( وحيدا‬alone) in Ayah 11 of al-modather – al- fanous.org . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 Ayah in full diacritical marks - quran.com . . . . . . . . . . . . . . . . 85 4.4 Query spell correction in Google.com . . . . . . . . . . . . . . . . . . . 85 4.5 Focusing on ”Yaqub” in Ontology of concepts of corpus.quran.com . . 86 4.6 Related searches suggestion in Google.com . . . . . . . . . . . . . . . . 87 4.7 Keyboard mapping Arabic to English in Google.com . . . . . . . . . . 87 4.8 Different romanizations for the word . . . . . . . . . . . . . . . . 88 4.9 Different transliterations used in ElixirFM Resolve Online . . . . . . . 88 4.10 Syntactic Coloration of Basmalah – bayt-al-hikma.com . . . . . . . . . 88 4.11 Google Voice Search on Android . . . . . . . . . . . . . . . . . . . . . . 89 4.12 Annotations shown by Quranic Arabic Corpus website . . . . . . . . . 90 4.13 Divine names Highlight in Quran Reader iPhone application . . . . . . 91 4.14 Faceted Thematic Browsing - Qurany Project . . . . . . . . . . . . . . 93 4.15 Audience Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.16 Audience experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 ‫خليفَة‬ َ 4.17 Clarity, Usefulness, and Need percentage of each feature . . . . . . . . 100 5.1 Basic Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2 The behavior of Searcher . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3 Text processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.4 Results processing phases . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.5 Different possible declensions of the word 5.6 Different types of the possible vocalizations of the word 5.7 Different writing forms of the word 5.8 Merging Words in Uthmani Script . . . . . . . . . . . . . . . . . . . . . 112 5.9 General Schema of Uthmani and Standard text processing. . . . . . . . 113 ‫الملك‬ . . . . . . . . . . . . . 109 ‫من‬ . . . . . . . 110 ‫ بسطة‬using Othmani script . . . . . 111 5.10 Example of Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.11 Tokenization of the word ‫فأَسق ْي َنٰكمو ُه‬ ُُ َْ َ . . . . . . . . . . . . . . . . . . . . 116 5.12 Sub-tokens separation schema . . . . . . . . . . . . . . . . . . . . . . . 118 5.13 Example of tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.14 Arabic case markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.15 Example of normalization . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.16 Example of stop-word filtering 5.17 Examples of lemmatization . . . . . . . . . . . . . . . . . . . . . . 121 . . . . . . . . . . . . . . . . . . . . . . . . 122 5.18 Two-Steps search behavior . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.19 Semantically related words : Idols in Quran . . . . . . . . . . . . . . . 124 5.20 Word properties search example : First person, Plural, Masculine . . . 125 xvi
  • 17. 5.21 Searching through an ontology . . . . . . . . . . . . . . . . . . . . . . . 125 5.22 Semantically Related Words, Hyponymy of the word ‫( نبي‬prophet) . . . 126 5.23 Multi-level Derivation Search example . . . . . . . . . . . . . . . . . . 127 5.24 Special derivations example, Imperative of ‫( قال‬to say) . . . . . . . . . 128 6.1 Screenshot of the Qt desktop interface . . . . . . . . . . . . . . . . . . 136 6.2 Json results of 6.3 Fuzzy search example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.4 Showing adjacent ayahs . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.5 Showing ayahs in different scripts . . . . . . . . . . . . . . . . . . . . . 141 6.6 Suggestion example of Vocalizations , Derivations ,and Synonyms of 6.7 Annotations of the keyword 6.8 Buckwalter translation example . . . . . . . . . . . . . . . . . . . . . . 142 6.9 Fields table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 ‫الكوثر‬ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 ‫. قيمة‬ ‫قول‬ 141 . . . . . . . . . . . . . . . . . . . . . 141 6.10 Translation-as-unit search , Query: seven . . . . . . . . . . . . . . . . 145 6.11 Word-as-unit json outout, Query: ‫قيمة‬ 6.12 Interfaces dependency hierarchy . . . . . . . . . . . . . . . . . . . . . 150 . . . . . . . . . . . . . . . . . . 145 6.13 API usage sample code . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.14 Preview of the JSON web service . . . . . . . . . . . . . . . . . . . . . 153 6.15 Preview of the Console interface . . . . . . . . . . . . . . . . . . . . . . 153 xvii
  • 18. List of Tables 1.1 Document Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2 Forward Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3 Inverted Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 N-gram index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 3 types of Arabic letters: 1 form, 2 forms or 4 forms . . . . . . . . . . . 26 2.2 The change of meaning by changing the diacritical marks . . . . . . . . 36 2.3 The change of function by changing diacritical marks . . . . . . . . . . 37 2.4 Ambiguities of prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5 Ambiguities due to suffixes . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1 Waqfs types [Web-Islamweb] . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Some numerical miracles of Qur’an[Nawfal1975] . . . . . . . . . . . . . 50 3.3 Index based on word parts . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Index based on sentences– surah: al-fātiha . . . . . . . . . . . . . . . . 56 3.5 Overview of main index– Midād lbayān . . . . . . . . . . . . . . . . . . 58 3.6 Overview of words index – M.Taha Zerrouki . . . . . . . . . . . . . . . 59 3.7 Overview of the topics index– Taha Zerrouki . . . . . . . . . . . . . . . 60 3.8 Overview of the index of synonyms – Taha Zerrouki . . . . . . . . . . . 60 3.9 Overview of morphology index – Quranic Arabic corpus . . . . . . . . . 62 3.10 Example of simple proper Quranic text – Tanzil.info . . . . . . . . . . 63 3.11 Example of Surah index – Tanzil.info . . . . . . . . . . . . . . . . . . . 63 3.12 Sajdah index – Tanzil.info . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.13 Example of rub’ index – Tanzil.info . . . . . . . . . . . . . . . . . . . . 64 3.14 Sample of Boundary Annotated Qur’an Corpus . . . . . . . . . . . . . 65 5.1 Partial vocalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2 Some numbers as they mentioned in Quran . . . . . . . . . . . . . . . 115 xviii
  • 19. 6.1 Search request flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.2 Pylint Analysis stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.3 Implementation State of search features . . . . . . . . . . . . . . . . . 148 xix
  • 20. List of Abbreviations AGPL Affero General Public License. API Application Programming Interface. GPL GNU Public License. GUI Graphical User Interface. IDF Inverse Document Frequency. OWL Web Ontology Language. PC Personal Computer. POS Part Of Speech. POS Part Of Speech. RSV Retrieval Status Value. TF Term Frequency. TF*TDF term frequency - inverse document frequency. UI User Interface . xx
  • 21. Glossary Abrogated ayahs Abrogating ayahs Accusative Active Participle Active voice Allegorical ayah Assimilated verb Attaching pronoun Ayah Basmalah Book Science Broken plural Conjugation Declension Declinable Defect Demonstrative pronoun Deverbal noun Diacritical marks Diphthong Diptote Dual form Expansion ‫الآيات_المنسوخة‬ ‫الآيات_الناسخة‬ ‫حالة_النصب‬ ‫اسم_الفاعل‬ ‫صيغة_مبني_للمعلوم‬ ‫الآيات_المتشابهات‬ ِ ‫فعل_م َثال‬ ‫ضمير_م َّتصل‬ ‫آية‬ ‫بسملة‬ ‫علم_الكتاب‬ ‫جمع_تكسير‬ ‫تصريف_الأفعال‬ ‫الإ عراب‬ ‫معرب‬ ‫علّة‬ ‫اسم_إشارة‬ ‫مصدر‬ ‫علامات_التشكيل‬ ‫الإ دغام‬ ‫ممنوع_من_الصرف‬ ‫المثنّى‬ ‫الزيادة‬ ّ xxi
  • 22. External feminine plural External masculine plural External plural Fiqh First ayahs of surah First person Fusional language Geminated verb General and Particular Genitive Hamzah Hamzated verb Healthy verb Hizb Hollow verb Imperative Imperfective Instrument noun Internal plural Inversion Jussive Juz’ Lam-ALef Last ayahs of surah Laws Lemma Lexicology Makkan Medinan ‫جمع_مؤنث_السالم‬ ‫جمع_مذكر_السالم‬ ‫جمع_سالم‬ ‫الفقه‬ ‫فواتح_السورة‬ ‫المتكلم‬ ‫لغة_إشتقاقية‬ ‫فعل_مضاعف‬ َ َ ُ ّ‫الخاص_والعام‬ ّ ‫حالة_الجر‬ ‫همزة‬ ‫فعل_مهموز‬ َْ ‫فعل_صحيح‬ ِ ‫ح ْزب‬ ‫فعل_أَج َوف‬ ْ ‫الأمر‬ ‫المضارع‬ ‫اسم_الآلة‬ ‫جمع_تكسير‬ ‫القلب‬ ‫حالة_الجزم‬ ‫ج ْزء‬ ُ ‫لام_ألف‬ ‫خواتيم_السورة‬ ‫الأحكام‬ ‫جذع_الكلمة‬ ‫علم_المعاجم‬ ‫مكّ ّية‬ ‫مدن ّية‬ xxii
  • 23. Migration of Prophet Morphology Mus’haf Narration of Hadiths Nisf Nomen Speciei Nomen Vicis Nominative Noun of place Noun of time Object Orthography Othmani script Passive Participle Passive voice People of the book Perfective Personal pronoun Plural form Primitive noun Prophet’s Sunnah Prosody Prostration of recitation Qiblah Qur’anic comma Qur’anic Parable Recitation Relative pronoun Revelation Science Rewayate Rewayate of Kaloun ‫الهجرة_النبوية‬ ‫الصرف‬ ‫المصحف‬ ُ ‫رواية_الأحاديث‬ ‫نِصف‬ ْ ‫اسم_الهيئة‬ ‫اسم_المرة‬ َّ ‫حالة_الرفع‬ ‫اسم_مكان‬ ‫اسم_زمان‬ ‫مفعول‬ ّ ‫علم_مرسوم_الخط‬ ّ ‫الخط_العثماني‬ ‫اسم_المفعول‬ ‫صيغة_مبني_للمجهول‬ ‫أهل_الكتاب‬ ‫الماضي‬ ‫ضمير_منفصل‬ ‫الجمع‬ ‫اسم_جامد‬ ‫الس َّنة_النبوية‬ ُ ‫تقطيع_العروض‬ ‫سجود_التّلاوة‬ ِ ‫القبلة‬ ‫فاصلة_قرآنية‬ ‫م َثل_قرآني‬ َ ‫التلاوة‬ ‫اسم_موصول‬ ‫علم_التنزيل‬ ‫رواية‬ ‫رِواية_قالون‬ xxiii
  • 24. Rhetoric Root Rubu’ Sajdah Second person Singular form Standard script Subject Superlative noun Surah Surah keys Tafssir The Five Nouns Third person Thumn Translation of Qur’an Triptote Verb with a simple root Verb with augmented root Virtues of surah Waqf Weakened verb ‫البلاغة‬ ‫جذر_الكلمة‬ ‫ر ُبع‬ ُ ‫سجدة‬ ‫المخاطب‬ ‫المفرد‬ ‫الخط_الإ ملائي‬ ‫فاعل‬ ‫اسم_تفضيل‬ ‫سورة‬ ‫مفاتيح_السورة‬ ّ ‫علم_تفسير_القرآن‬ ‫الأسماء_الخمسة‬ ‫الغائب‬ ‫ثُمن‬ ُ ‫ترجمة_معاني_القرآن‬ ‫منصرف‬ ‫فعل_مجرد‬ ّ ‫فعل_مزيد‬ ‫فضائل_السورة‬ ‫َوقف‬ ِ ‫فعل_نَاقص‬ xxiv
  • 25. General Introduction Work Context Qur’an, in Arabic, means the read or the recitation. Muslim scholars define it as: the words of Allah revealed to His Prophet Muhammad, written in Mus’haf and transmitted by successive generations (‫[)التواتر‬Mahssin1973]. The Qur’an is also known by other names such as: Al-Furkān , Al-kitāb , Al-dhikr , Al-wahy and Al-rōuh . It is the sacred book of all Muslims and the first reference to Islamic law. It’s more then 14 centuries passed since its revelation, and the Muslims are still studying it, teaching it, writing books about it and recently developing applications for it. Qur’an is an important source of information that contains various information about all aspects of life: Scientific, Social, Historic, Politic...etc. Problematic Due to the large amount of information held in the Qur’an, it has become extremely difficult for regular search engines to successfully extract key information. For example, When searching for a book related to English grammar, you’ll simply Google it, select a PDF file and download it. That’s all! Search engines (like Google) are used generally on Latin letters and for searching general information of document like content, title, author…etc. However, searching through Qu’ranic text is a much more complicated; It’s procedure that’s requiring a much more in depth solution as there is a lot of information that needs to be extracted to fulfill Qur’an scholar’s needs. Before the creation of computer, Qur’an scholars were using printed lexicons made manually. The printed lexicons can’t help much since many search process waste the time and the force of the searcher. Each lexicon is written to reply to a specific query which is generally simple. Nowadays, there are many applications that are specific for search needs; most of applications that were developed for Qur’an had the search feature but 1
  • 26. General Introduction in a simply way: sequential search with regular expressions. The simple search using exact query does not offer better options and still inefficient to move toward Thematic search by example. Full text search is the new approach of search that replaced the sequential search and which is used in search engines. Unfortunately, this approach is not applied yet on Qur’an. The question is why we need this approach? Why search engines? Do applications of search in Qur’an really need to be implemented as search engines? Objectives Our proposal is about design a retrieval system that fit the Qur’an search needs. But to realize this objective, we must first list and classify all the search features that are possible and helpful. Then we need to study how to implement each feature and what is its requirements. Report organization We organized the report as follows: First Part : Art State This part contains 3 chapters: Chapter 1 : Search Engines To design a powerful search engine, it is essential to understand how search engines work, in this chapter we discuss the different parts of a search engine, namely: the crawling, indexing and querying . And the definition of basic concepts in the field of information retrieval systems. This chapter contains an introduction to the semantic approach. Chapter 2 : Arabic Language The objective of this chapter is to present the properties of the Arabic language, its spells, its morphology and to introduce some ambiguity issues that raise due to the Arabic nature ... etc. Chapter 3 : The Qur’an This chapter presents an overview of the Qur’an and its sciences, it has a historical background on the evolution of the Qur’an, the structure of the Mus’haf, and the main problems of computerization of the Qur’an, including the script Uthmani and authentication Qu’ranic texts. 2
  • 27. General Introduction Second Part : Analysis & Conception This part contains two chapters : Chapter 4 : Qur’anic search features The objective of this chapter is to present the possible search features in Qur’an. It has a big importance in our work since it defines our objectives and our path on the work. We’ve make a survey about Usefulness, Need, and Clarity of each feature in order to validate our points of view in choosing those features. Chapter 5 : Conception In this chapter, we start by a preview on our previous work then we’ll propose many improvements to carry out all the feasible search features mentioned in the previous chapter. Third Part : Implementation This part contains the different steps of implementation of our retrieval system. It includes one chapter: Chapter 6 : Implementation This chapter describes the choice of technologies and development tools and also presents the prototype with a description of various features. Finally, we finish the report with a conclusion that summarizes our work. We include an appendix that describes the papers published about this work. Actually there are two papers: • An Arabic paper in NITS 2011 KSA entitled ”An Application Programming Interface for indexing and search in Noble Quran”1 [Chelli2011]. • An English paper in a pre-conference workshop in LREC 2012 Turkey which is about ”LRE-Rel: Language Resource and Evaluation for Religious Texts”. The paper was entitled ”Advanced Search in Quran: Classification and Proposition of All Possible Features”[Chelli2012]. 1 Arabic title: ‫مكتبة برمجية للفهرسة والبحث في القرآن الكريم‬ 3
  • 29. Chapter 1 Search engines How could the world beat a path to your door when the path was uncharted, uncatalogued, and could be discovered only serendipitously? Paul Gilster, Digital Literacy 1.1 Introduction Our work falls within the field of Information Retrieval, as it aims to design a search engine, in this chapter we will discuss how search engines work by explaining its main components. Exploration is the part that feeds the search engine by documents that it collects, but with the amount of information that becomes larger and larger, it is necessary to develop methods of search, only indexing able to accelerate search in very large systems such as the Web, because it anticipates the search by extracting and arranging them keywords. So that search results be satisfactory, we must properly calculate the relevance of results against the query, this is done during the interrogation. The question must also be able to express simple questions as well as complex questions. The quality of research is directly related to the quality of the crawling, indexing and search, these three operations can be considered as the core of search engine, the objective of this chapter is to define the main concepts of this area, starting with defining the crawling, then study indexing, its methods and steps, and then we shall explain the process of search and the notion of relevance. 5
  • 30. Chapter 1 1.2 Definitions 1.2.1 Keyword Word or set of words chosen to represent the contents of a document, and find it in document search. It can be coming from the document (title, text, abstract, ...) or a controlled vocabulary..[Hensens1998] 1.2.2 Descriptor Keyword selected from a set of equivalent terms to represent clearly a concept. It is usually part of an organized and hierarchical vocabulary of type ”thesaurus”.[Hensens1998] 1.2.3 Document A document can be text, a piece of text, web page, image, video, etc. We call Document any unit that can be an answer to a user query. For textual documents, there are many forms regarding their specification. A document can be a text without any structure (it is also called full-text) and may also be a text with a structured part (document partially structured or semi structured) or fully structured. [Amrouche2008] 1.2.4 Query A query expresses the need for information of a user. Various types of query languages have been proposed to formulate a query. A query can be expressed: • In natural (or almost) language (eg: ”find all the manufacturing facilities of cars and their addresses”)[Salton1971] • In a structured format, also called Boolean query language (eg: ”cars and factories and brand”)[Bourne1979] • As graphical language from a GUI [Lelu1992] 1.2.5 Relevance Relevance is a word that simply means returning the information considered the most useful at the top of a result list. While the definition is simple, getting a program to compute relevance is not a trivial task, mainly because the notion of usefulness is hard for a machine to understand. [Bernard2009] 6
  • 31. Chapter 1 1.3 Search engines A search engine is software that allows to regain resources (web pages, images, video, files, information ... etc.) related to any words. Some websites offer a search engine as the main feature, called then search engine the website itself (Google, Yahoo, Bing ... are search engines). [Nejjari2007] A search engine is also a crawling tool on the web made up of ”robots” that explore the websites periodically and automatically (without human intervention, that is what distinguishes search engines from directories). They follow the links (pages that link to each other) encountered on each page reached. Each identified page will be indexed in a database, then will be accessible by Internet users using keywords.[Sanan2008] Search engines do not apply only to Web: some engines are softwares installed on personal computers. These are known as desktop search engines , they aims the search in the files stored on the PC - include such Exalead Desktop, Google Desktop and Copernic Desktop Search ... etc. In December 2004, Tim Berners Lee (the inventor of the World Wide Web) talked about a new project: ”Semantic Web” which is based on processing of the web information automatically according to their significances. The reason for this was that 80% of Web contains texts intended to be read and understood by humans. While computer programs, Web browsers and search engines are unable to understand this content, so they are unable to speed up the search. In less than two years from the article of Lee, the First foundations of Semantic Web were formed. They seemed to lead the world toward a new revolution in Internet and search engines .[Abulhajjaj2009] At first glance, nothing distinguishes a classic search engine from a semantic one. The same sparse interface, with a text box in the center of the page where the user can enter his search query. In fact, the difference lies in the search mode. A classic search engine , as Google, works as follows: its robots index browse the pages and index the words. Then store these words in a gigantic database. Users can do search by send their queries and a search algorithm retrieve the results and sort them in a certain order based on their relevance.[Mentre2008] 1.4 Full-text search Full-text search is a technology focused on finding documents matching a set of words . While sounding like a mouthful, full-text search is more common than you might think. You probably have been using full-text search today. Most of the web search 7
  • 32. Chapter 1 engines such as Google and Yahoo! use full-text search engines at the heart of their service. The differences between each of them are recipe secrets (and sometimes not so secret), such as the Google PageRankTM algorithm. PageRankTM will modify the importance of a given web page (result) depending on how many web pages are pointing to it and how important each page is . Be careful, though; these so-called web search engines are way more than the core of full-text search: They have a web UI , they crawl the web to find new pages or existing ones, and so on. They provide business-specific wrapping around the core of a full- text search engine. Given a set of words (the query), the main goal of full-text search is to provide access to all the documents matching those words. Because sequentially scanning all the documents to find the matching words is very inefficient, a full-text search engine (its core) is split into two main operations: indexing the information into an efficient format and searching the relevant information from this precomputed index. From the definition, you can clearly see that the notion of word is at the heart of full-text search; this is the atomic piece of information that the engine will manipulate. [Bernard2009] 1.5 Crawling Crawling is the process by which we gather pages from the Web, in order to index them and support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them. the web crawler is sometimes referred to as a spider . [Manning2009] This process is in the phase preceding the indexing phase, see Figure: 8
  • 33. Chapter 1 Figure 1.1: The various components of a web search engine 1.5.1 Crawler Features We list the desiderata for web crawlers in two categories: features that web crawlers must provide, followed by features they should provide. [Manning2009] 1.5.1.1 Features a crawler must provide Robustness : The Web contains servers that create spider traps, which are gener- ators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be designed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development . Politeness : Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them. These politeness policies must be respected . 1.5.1.2 Features a crawler should provide Distributed : The crawler should have the ability to execute in a distributed fashion across multiple machines . Scalable : The crawler architecture should permit scaling up the crawl rate by adding extra machines and bandwidth . 9
  • 34. Chapter 1 Performance and efficiency : The crawl system should make efficient use of various system resources including processor, storage and network band- width. Quality : Given that a significant fraction of all web pages are of poor utility for serving user query needs, the crawler should be biased towards fetching “useful” pages first. Freshness : In many applications, the crawler should operate in continuous mode: it should obtain fresh copies of previously fetched pages. A search engine crawler, for instance, can thus ensure that the search engine’s index contains a fairly current representation of each indexed web page. For such continuous crawling, a crawler should be able to crawl a page with a frequency that approximates the rate of change of that page. Extensible : Crawlers should be designed to be extensible in many ways – to cope with new data formats, new fetch protocols, and so on. This demands that the crawler architecture be modular. 1.6 Indexing To make the research cost acceptable, it should pass  by an essential phase in the document database. This phase consists in analyzing each document in the collection to create a set of keywords: we call it the indexing phase. These keywords will be more easily used by the system during the subsequent process of search. Indexing create a representation of documents in the system. Its objective is to find the most important concepts of the document (or query), which form the descriptor of document.[Sauvagnat2005] 1.6.1 Definition Indexing is the act of describing or classifying a document by index terms or other symbols in order to indicate what the document is about, to summarize its content or to increase its find-ability. In other words, it is about identifying and describing the subject of documents. Indexes are constructed, separately, on three distinct levels: terms in a document such as a book; objects in a collection such as a library; and documents (such as books and articles) within a field of knowledge. The process of indexing begins with any analysis of the subject of the document. The indexer must then identify terms which appropriately identify the subject either 10
  • 35. Chapter 1 by extracting words directly from the document or assigning words from a controlled vocabulary. The terms in the index are then presented in a systematic order. Indexers must decide how many terms to include and how specific the terms should be. Together this gives a depth of indexing.[Lancaster2003] Indexing is most often used to information retrieval. But it can also be used in other areas such as automatic classification of documents,   keyword suggestion, co-occurring terms calculating, automatic summarization, etc.[Abar2009] Figure 1.2: Indexing Benefits 1.6.2 Indexing modes 1.6.2.1 Manual indexing Manual indexing is achieved by a human expert (librarian or specialist in the field ) that analyzes the content of the text to identify the terms representing the document. Manual indexing ensures greater relevance in the answers, because it identifies a more specific keywords describing a document. However, it has several drawbacks, there is the problem of used vocabulary and the dependence on indexer’s knowledge on the topic, ie the same document can be indexed in several ways (according to vision of the person who makes the indexing), and an indexer at two different times can have two distinct terms to represent the same concept. The major drawback of this method is the cost in time, this method is not therefore appropriate when the number of documents to be indexed is substantial. [Sauvagnat2005, Abar2009, Amrouche2008] 11
  • 36. Chapter 1 Manual indexing is based on four key points [Chartron1989] : • reading the entire document for preparation ; • consideration of descriptors, objectives (applications) and user needs; • permanent complementarity between the terms of manual indexing and abstract; • in the absence of appropriate descriptor, and when the emergence of a new concept is not explicit enough to propose a candidate descriptor, the ability to use a close or generic descriptor. So we thought fast enough to use the computer[Mustafa]. 1.6.2.2 Automatic indexing Automatic indexing is a set of automated processing phases applied on documents. We distinguish: Tokenization (automatic extraction of word), Elimination of stop words, Stemming (Lemmatization or radicalization), Scoring of words and finally the creation of the index[Sauvagnat2005]. The first approach to the automatic indexation KWIC (Key Word In Context-) was introduced by Luhn (1957)[Luhn1957]. There was discussion about to weight the index. In the early days of information retrieval, statistical methods were based on the frequency of words in the document. Later, this measure was extended to take into account the specificity of a term for the document. To this end, other methods have been exploited, such as 2-Poisson (Nie, 2003)[Gaussier2003][Mustafa]. The automatic indexing systems use several methods of analysis: 1.6.2.2.1 Linguistic analysis: Technology issued from ”text mining”, the latter is to implement a simplified model of linguistic theories in computer systems of learning. This is part of the artificial intelligence field . [Allab2008] The linguistic method consists of several modules of linguistic analysis: morphological, lexical, syntactic and pragmatic. The fact that some systems use indexing techniques of natural language processing, demonstrates the relevance of a linguistic approach. [Elhachani1997] 1.6.2.2.2 Statistical analysis: The initiator of the methods of the automatic indexation is H.P. Luhn with his influential article “The automatic creation of literature abstracts” published in 1958 in the “Journal of Research and Development” of IBM. He states : « (...) instead of sampling at ran- dom, as a reader normally does when scanning, the new mechanical method selects those among all the sentences of 12
  • 37. Chapter 1 an article that are the most representative of pertinent information », H. P. Luhn opened the door to work on automatic indexing by proximity also called statistical method[Luhn1958]. Automatic indexing involves the following steps : • Extracting words (tokenization): the extraction rules are language-dependent. • Eliminating stop words (stop words): these are words too frequent but unnecessary. Example: the, a, of, or ... etc. • Stemming : for example the stem of the word ”stemmers” is ”stem”. • Transformation rules: removal of plural endings. • Truncation: choose an optimal value of truncation of words. It is better to truncate suffixes. There is no absolute rule for this. 1.6.2.3 Semi-automatic indexing The two previous techniques can be combined, a first automatic process to extract the terms of the document. However the final choice remains the expert in the field or librarian to establish semantic relations between keywords and choose the significant terms using a thesaurus or a terminology database which is an organized list of descriptors (keywords) obeying specific terminology rules[Abar2009, Sauvagnat2005, Hadjhenni2008]. 1.6.3 Index types The index is the output of the indexing process, there are several types of indexes according to the used technique and the desired function: 1.6.3.1 Document Index The document index keeps information about each document. It is an index ISAM (Index sequential access mode) with a fixed width, ordered by the ID of the document. The information stored in each entry includes data, a checksum of documents and various statistics. If the document was crawled, it also contains a pointer to a variable width file called the document information that contains the URL and title. This design decision was driven by the desire to have a relatively compact data structure, and the ability to find a record in one disk traversal, when queried.[Brin1998] The following table is a simplified illustration of a document index: 13
  • 38. Chapter 1 Document ID Document 1 Document 2 Document 3 Text The cow says moo The cat and the hat The dish ran away with the spoon Link /ex/doc1.txt /ex/doc2.txt /ex/doc3.txt Table 1.1: Document Index 1.6.3.2 Forward Index The forward index stores a list of words for each document. The following is a simplified form of forward index: Document ID Document 1 Document 2 Document 3 Words the, cow, says, moo the, cat, and, the, hat the, dish, ran, away, with, the, spoon Table 1.2: Forward Index The rationale behind developing a forward index is that as documents are parsing, it is better to immediately store the words per document. The delineation enables Asynchronous system processing, which partially circumvents the inverted index update bottleneck. The forward index is sorted to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index. [Brin1998] 1.6.3.3 Inverted index Many search engines include an inverted index when evaluating a search query to quickly retrieve documents that contain words in the query and then sort them by relevance. Since the inverted index stores the list of documents containing each word, the search engine can use direct access to find documents associated with each word in a query to retrieve documents that respond quickly. The following table is a simplified illustration of an inverted index: 14
  • 39. Chapter 1 Word the cow says moo Documents Document 1, Document 3, Document 4, Document 5 Document 2, Document 3, Document 4 Document 5 Document 7 Table 1.3: Inverted Index This index can identify only if a word exists in a particular document because it does not store any information regarding the frequency or the position of the word. It is considered an index of boolean. This index determines which documents that match a query, but does not classify them. In some models, the index includes additional information such as frequency of each word in each document or positions of a word in each document. The position information allow the search algorithm to identify the adjacent words to support the search by phrases. The frequency can be used to assist calculating the relevance of documents to the query. [Grossman2002, Tang2004] 1.6.3.4 N-gram index An n-gram is a sequence of n consecutive characters. For any document, all n-grams (usually n takes the values 2 or 3) we can generate, is the result obtained by shifting a window of n squares on the body text. This shift occurs in steps, one step corresponds to a character. Then we calculate the frequencies of n-grams found. for example 1 [Jalam2002]the french sentence ”La nourrice nourrit le nourrisson” is represented by : n-grams Frequencies 1 la_ 1 2 a_n 1 3 _no 3 4 nou 3 5 our 3 6 urr 3 7 rri 3 8 ric 1 9 ice 1 10 _ce 1 11 e_n 2 12 rit 1 … … … Table 1.4: N-gram index One benefit of n-grams is automatic tracking of the most common stems [Grefenstette1995]: dans in the previous example, using techniques based on n-grams we find the common root of : Nourrir, nourri, nourrit, nourrissez, nourriture, etc. Tolerance to spelling mistakes and distortions is also an important property. [Sanan2008] 1.6.4 Index storage Storage of index structures is mainly characterized per the index size and organization of its elements. Index structures vary widely in their use of size that is closely related 1 the character ”_” is used instead of spaces, in order to facilitate the reading. 15
  • 40. Chapter 1 to the organization of data in the index. This organization has a significant impact on latency of search. More items are closely related to each other in the storage space is less latency research, this is called the concept of locality. It is also very important that the index can hold in main memory, it avoids disk access to the system and reduces the latency of search. The ideal index is one that occupies less space and minimize search latency. [Dahak2006] 1.6.5 Index update Updating the index refers to the behavior of applying changes on the index . Changes can be insertions, modifications or deletions. An index can be more or less able to adapt to these changes. This adaptation can occur in two forms: 1.6.5.1 Incremental update In the case of an incremental update, the structure of the index is updated by adding back the indexes of new documents without modifying existing ones. The number of changes in this case is, however, often limited.[Dahak2006] 1.6.5.2 Global update The third case, and worst is when the structure of the entire index must be rebuilt from scratch.[Dahak2006] 1.6.6 Indexing phases The indexing process consists of the following phases: 1.6.6.1 Tokenization Tokenization is a phase that may seem trivial at first, and yet provide the basis for the rest of the indexing phases. Therefore this phase must be done with the highest quality.[Meylan2001] Some retrieval systems use a list of predefined keywords. This list is designed manually and, in most cases built for a specific topic. This method allow to control the index size. The use of automatic extraction of keywords or the use of a list of predefined keywords, determines the type of indexing. Document-oriented in the first case and query-oriented in the second.[Berrut1997, Dahak2006] 16
  • 41. Chapter 1 1.6.6.2 Normalization This processing is to find for a word its normalized form (usually the masculine for nouns, infinitive for verbs, the masculine singular for adjectives, etc.). Thus, in the index are stored only in their normalized forms, which offers a significant size saving, but more importantly, even if the processing is done on the request, it can be much quicker and more flexible in research: for example, if a user searches with a verb, documents that contains this verb in all its conjugated forms will be considered, not just documents containing the word in the form provided by the user. This step is also called ”morphological processing of keywords”[Denoyer2004] This phase can also be enriched with syntactic and semantic processing of keywords. The first is to identify and group a set of words whose meaning depends on their union. For example, the words ”White House” does not usually mean you’re dealing with a house that is white, but instead the seat of the presidency of the United States. It is also to remove ambiguities such as the problems of homography. Semantic processing is intended to make distinctions between different possible meanings of a word (polysemy). For example, this phase helps differentiate the word ”room” that can match a coin, or a room in a house. This is an arduous task that is not currently well controlled and its effect on system performance is not always proven.[Dahak2006] 1.6.6.3 Elimination of stop-words This phase is of some importance since it constitutes a factor of great influence in the accuracy of the search. The failure to remove stop words inevitably cause noise. The elimination of stop words which are words of everyday language and do not contain much semantic information must be both in indexing as querying (removing stop-words from the query). [Dahak2006] 1.6.6.4 Weighting This step is entirely dependent on the model of information retrieval used. It defines how important a term in a given document. [Dahak2006] In general, most term weighting formulas are built by combination of two factors. A local weighting factor measuring the local representativity of a term in the document, and an overall weighting factor measuring the global representativity of a term with respect to the collection of documents[Amrouche2008]. That leads to two types : 17
  • 42. Chapter 1 1.6.6.4.1 Local weighting Local weighting takes into account the local informa- tion of the term that depend only on the document. It is typically a function of frequency of occurrence of the word in the document, denoted tf (Term Frequency). A term that frequently appears in a document is considered relevant to describe its contents. [Dahak2006] 1.6.6.4.2 Overall weighting The overall weight measures the importance of a term within all documents. It aims to represent its discriminatory nature, or in other words its ability to distinguish between document. In fact, a term appearing in few documents is considered more discriminatory and should be favored over a term found in many documents. The calculation of the overall weighting is based on the number of documents in which a term appears. One of the most used is idf (Inverse Document Frequency), represented by the following formula: N Idf = log( ni ) Such as ni is the number of documents containing the word i and N is the total number of documents. The value tf *idf gives a good approximation of the importance of a term in the document, particularly in the corpus of documents of similar size. [Dahak2006] 1.7 Querying Querying is the phase of interaction between the system and the user. This expresses the need for information via a query language that the system will take care of interpreting. This interpretation is done according to the query template and is designed to understand user needs and express them in a formalism similar to the one used when indexing documents. This process provides an inner query. Following this phase of query interpreting, a matching pattern calculates the match between the inner  query and each document in the index. This calculation established by the mapping function, has traditionally resulted in an ordered list of documents. It should, at this level, a semantic comparison (not equal) between concepts in of document and those of the query. The comparison between query and document rarely leads to strict equivalences, but rather to partial equivalences: the document is only part of the query. The first document in the list returned by the system is one that is considered by the system as the most relevant, that is to say the one that best suits the query, again according to the system. The final document is one that is considered by the system as the 18
  • 43. Chapter 1 least relevant. This notion of relevance is based on the proximity between the needs expressed by the user and the results provided by the system.[Dahak2006] 1.7.1 Relevance concept Relevance is a central concept of the query because all evaluations are based around this concept. But it is also the most poorly understood concept, despite numerous studies on this concept as the one in[Denos1997]. Let us see some definitions of the relevance. Relevance is: • The correspondence between a document and a query, a measure of informativeness of the document to the query; • A degree of relationship (overlap, relativity, ...) between the document and the query; • A degree of surprise that comes with a document that is relevant to the needs of the user; • A measure of usefulness of the document to the user. Even in these definitions, the used concepts (informativeness, relativity, surprise ...) remains very vague because users have very different needs. They have very different criteria for judging whether a document is relevant. So the notion of relevance is used to cover a very wide range of criteria and relations[Dahak2006]. 1.7.2 Similarity Function Comparing between the document and query is equivalent to calculating a score, assumed to represent the relevance of the document in respect to the query. This value is calculated from a function or a probability of similarity denoted rsv(q,d) (retrieval status value), such as q is a query and d est a document and whose formula depends entirely on the used model of information retrieval. This measure takes into account the weight of terms in documents determined by statistical analysis and probability. The matching function is very closely related to the operations of indexing and weighting of query terms and documents in the corpus. In general, the matching document - query and indexing model used to characterize and identify a model of information retrieval. The similarity function is then used to order the documents returned to the user. The quality of this ordering is paramount. In fact, users is generally satisfied to examine the first documents (the top 10 or 20). If the documents sought are not present in this slice, the user will consider sorting as bad in respect to his query[Sauvagnat2005, Dahak2006]. 19
  • 44. Chapter 1 1.7.3 Search process Search takes a user query and returns the effective list of matching results sorted by relevance. Such as indexing, searching is a multiphase process, as shown in Figure. Figure 1.3: Search process The first operation is about building the query. Depending on the full text search, the way to express query is either: : 1. String based—A text-based query language. Depending on the focus, such a language can be as simple as handling words and as complex as having Boolean operators, approximation operators, field restriction, and much more! 2. Programmatic API based—For advanced and tightly controlled queries a programmatic API is very neat. It gives the developer a flexible way to express complex queries and decide how to expose the query flexibility to users. Some tools will focus on the string-based query, some on the programmatic API, and some on both. The second operation, let’s call it analyzing, is responsible for taking sentences or lists of words and applying the similar operation performed at indexing time (chunk 20
  • 45. Chapter 1 into words, stems, or phonetic description). This is critical because the result of this operation is the common language that indexing and searching use to talk to each other and happens to be the one stored in the index. If the same set of operations is not applied, the search won’t find the indexed words—not so useful! Based on the common language between indexing and searching, the third operation (finding documents) will read the index and retrieve the index information associated with each matching word. Remember, for each word, the index could store the list of matching documents, the frequency, the word positions in a document, and so on. The implicit deal here is that the document itself is not loaded, and that’s one of the reasons why full-text search is efficient: The document does not have to be loaded to know whether it matches or not. The next operation (filtering and ordering) will process the information retrieved from the index and build the list of documents (or more precisely, handlers to docu- ments). From the information available (matching documents per word, word fre- quency, and word position), the search engine is able to exclude documents from the matching list. More important, it is able to compute a score for each document. The higher its score, the higher a document will be in the result list. let’s have a look at some factors influencing its value : • In a query involving multiple words, the closer they are in a document, the higher the rank. • In a query involving multiple words, the more are found in a single document, the higher the rank. • The higher the frequency of a matching word in a document, the higher the rank. • The less approximate a word, the higher the rank. Depending on how the query is expressed and how the product computes score, these rules may or may not apply. This list is here to give you a feeling of what may affect the score, therefore the relevance of a document. Once the ordered list of documents is ready, the full-text search engine exposes the results to the user. It can be through a programmatic API or through a web page. the following figure shows a result page from the Google search engine.[Bernard2009] 21
  • 46. Chapter 1 Figure 1.4: Search results returned as a web page: one of the possible ways to expose results 1.8 Semantic Approach Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results. Semantic search systems consider various points including context of search, location, intent, variation of words, synonyms, generalized and specialized queries, concept matching and natural language queries to provide relevant search results[Web-Techulator]. Major web search engines like Google and Bing incorporate some elements of semantic search. Rather than using ranking algorithms such as Google’s PageRank to predict relevancy, semantic search uses semantics, or the science of meaning in language, to produce highly relevant search results. In most cases, the goal is to deliver the information queried by a user rather than have a user sort through a list of loosely related keyword results. However, Google itself has subsequently also announced its own Semantic Search project[Web-WSJ]. Other authors primarily regard semantic search as a set of techniques for retrieving knowledge from richly structured data sources like ontologies. Such technologies enable the formal articulation of domain knowledge at a high level of expressiveness and could enable the user to specify his intent in more detail at query time[Web-ESWC2012]. Semantic search does not just mean contextual search or search based on the intend of the question. It include several other factors as well. A smart search engine would consider several factors to provide the most relevant and useful search queries, 22
  • 47. Chapter 1 including[Web-Techulator]: • Current trend: If the president election was just finished in the country and someone is searching for ’Who is the new president’, the semantic search system should be able to understand the query and give relevant results based on the current trend and news. • Location of search: If a person is searching for ’what is the temperature’, the semantic search engine should be able to provide results based on the current location of the search. If the person is searching from California, search results should include the current temperature in California. • Intend of the search: Semantic search engines should be able to give appropriate search results based on the intent of the search and not based on the specific words used in the search query. • Variations of words in Semantic Search: Semantic search should consider tenses, plural, singular etc and provide relevant search results for all semantic variations of the words. For example, words like dog, dogs, dog’s etc. • Synonyms and Semantic Search: A semantic search engine should be able to understand the synonyms and give more or less the same search results on any synonyms of the word users search for. For example, try searching for ”biggest mountain” and ”highest mountain”. You would get pretty much the same results since both of them means the same in this particular query, even though the ”biggest” and ”highest” could mean different things in different cases. • Generalized and Specialized queries: Semantic Searching engine should be able to set relation between generalized and specialized queries and provide appropriate and relevant results. For example, consider an article on general health topics and another article specifically on Diabetes. If someone search for health information, both articles could match even though the article on Diabetes does not talk specifically about ”health”. • Concept matching: This is a sub-set of context matching in semantic search. Semantic search should understand the broad concept of the query and return relevant results. For example, a query on ”Traffic problems in New Jersey” could return relevant results including the topics ”narrow roads”, ”non functioning traffic lights”, ”lack of roadside assistance” etc because in a broad conceptual point of view, all of these lead to traffic problems. 23
  • 48. Chapter 1 • Natural language queries: Not everyone are tech savvy and not many people know what to search to get the relevant search results. Most users simply type in queries in natural language. For example, if some one want to find what is the current time in Arizona, USA, they would search for ’What time is it in Arizona’. Most search engines would simply show results from the websites and articles that talk about Time and Arizona. However, smart search engines that use Semantic Search would actually show you the current time in Arizona, USA. Try it yourself at Google search. • Change of meaning based on the group of words: By combining different words, the true meaning of search term could change. Consider the following search terms: ◦ New egg health products ◦ New egg health benefits If you search for both the above terms in Google, you would get completely different meaning. Instead of just picking the results based on the words, Google Search looks at it as a term and then combines with common user search pattern. The first term returns search results primarily on the popular online shopping website NewEgg.com and shows results of health products from that site and similar sites. The second term shows search results for the health benefits of Egg. Semantic Search is a big challenge for search engines and none of them are perfect. Most search engines have improved significantly in last few years. Search engines like Bing and Google provide significantly relevant search results incorporating some degree of semantic search. There are many other specialized search engines (like Hakia) which offer purely semantic search results, but they lack many other qualities of normal search engines. 1.9 Conclusion In this chapter, the study focused on the working mechanism of search engines and information retrieval systems, based on indexing due to its importance. Indeed, it is the most important step in the search process as it allows the extraction and processing of keywords. The search phase does not offer only the interaction between users and the system, but also calculates the match percentage between the query and the documents to provide the most relevant results. 24
  • 49. Chapter 2 Arabic Language ‫أَنا البحر في أَحشائ ِِه الدر كامـــن *** فهل سأَلوا الغواص عن صدَفاتـــي‬ َ َ َ َّ ٌ ِ ُّ ُ ُ َ َ ََ ‫حافظ ابراهيم، قصيدة عن اللغة العربية‬ 2.1 Introduction Arabic ( ‫ ) العربية‬is a name applied to the descendants of the classical Arabic language of the sixth century AD, the most widely used in the Qur’an, the Islamic holy book. Arabic is a Central Semitic language, closely related to modern Hebrew and Aramai languages1 . In this chapter, we will talk about orthography and morphology of the Arabic language that are unique. we will also talk about some ambiguities that may appear in Arabic due to the absence of vocalization. 2.2 Orthography The Arabic script is one of the most used scripts all over the world. It dominates in the Arab countries, of course, but holds a special place for all Muslims because it is the script used to write the Qur’an.[Jabri] It is written from right to left like other Semitic languages. Its alphabet has twentynine2 consonant letters, three of them . ‫يوا‬ are considered as vowels. Optionally, 1 Modern Aramaic, languages are varieties of Aramaic that are spoken vernaculars in the medieval to modern era, evolving out of Middle Aramaic dialects around AD 1200. 2 The scientists of Arabic language considered the Hamzah as a letter 25
  • 50. Chapter 2 one of the three Diacritical marks . ‫ـِ ُـ َـ‬ can be placed after certain characters to resolve the ambiguity in pronunciation and/or direction when it arises. In a fully vocalized Arabic text, the lack of diacritics can be regarded as a sokōune . ‫ْـ‬ (silence). In some cases, a letter doubled, may be replaced by a single letter with tashdeed . (reinforcement) placed above. [AlKharashi1999] ّ‫ـ‬ In addition, it is important to note that the notion of uppercase and lowercase letter does not exist, the Arabic writing is called unicameral. In addition, Arabic is a language semi cursive, most letters are attached to each other, their spellings differ depending on whether they are preceded and/or followed by other letters or they are isolated. Only six of them does not attach to the following letter : . one letter does not attach at all : . ‫ء‬ ‫وزرذدا‬ and [Mesfar2008] Letter Hamzah Wâw Ayn Spellings ‫ء‬ ‫و ـو‬ ‫ـع ـعـ عـ ع‬ Table 2.1: 3 types of Arabic letters: 1 form, 2 forms or 4 forms 2.3 Lexicography The traditional Arabic grammar has only three subsets: Nouns, Verbs and Particles. 2.3.1 Verbs A verb is an entity expressing a time-dependent sense. Most Arabic verbs are formed ‫ك َتب‬ َ َ and eventually four consonants that is the case of the verb . ‫دحرج‬ َ َْ َ on three radical consonants that is the case of the verb . (kataba – write) (dahrağa – roll along). These roots may form several patterns as a result of one or more morphological transformations (eg: repetition of a consonant, lengthening a vowel, the  expanding of a morpheme, etc.), it comes in this case to roots with augmented pattern. Several linguistic studies have been conducted on the verbal system in Arabic, see[Larcher2003]. In this section, it is necessary to introduce a classification of verbs according to their radicals: 2.3.1.1 Verbs with a simple root (‫المجرد‬ ّ ‫:)الفعل‬ A verb with a simple root has a base of three consonants called radical consonants. These verbs are associated with verbal pattern . ‫فعل‬ َ ََ (fa’ala). When none of the root 26
  • 51. Chapter 2 consonants of the verb is a long vowel, it is called healthy. These radicals may involve ِ processing or causes of defects (‫ ,)علَّة‬we mention : • The presence of . ٔ‫ا‬ (á – hamzä), . ‫ي‬ (y - yâ’) or . ‫و‬ (w – wâw) among the radical consonants. Depending on the position of that, we distinguish different types of verbs : ◦ If one of the root consonants is . = Hamzated verb (‫مهموز‬ َْ ٔ‫ا‬ (á – hamzä), independently of its position ‫; )فعل‬ ‫و‬ ◦ The first radical consonant is a . (‫مثال‬ َِ ‫;)فعل‬ ◦ The second radical consonant is a . ‫;)أَج َوف‬ ْ ◦ The third radical consonant is a . ِ ‫;)نَاقص‬ ‫و‬ (w) or . ‫ي‬ (y) = Assimilated verb (w) or . ‫ي‬ (y) = Hollow verb (‫فعل‬ ‫و‬ (w) or . ‫ي‬ (y) = Weakened verb (‫فعل‬ • The presence of two identical consonants in the second and third position of the root = Geminated verb (‫مضاعف‬ َ َ ُ ‫.)فعل‬ 2.3.1.2 Verbs with augmented root (‫المزيد‬ ‫:)الفعل‬ The patterns of verbs with augmented root are formed from simple roots by a set of morphological operations to provide a specific meaning to the outcome verbs , we mention: • . • . • . • . • . • . • . • . ‫فعل‬ َّ َ ‫فاعل‬ ََ َ ‫أَفعل‬ َ َْ (fa’ala) (faā’ala) (áfa’ala) ‫( َتفعل‬tafa”ala) َ َّ َ ‫( َتفَاعل‬tafaā’ala) ََ ‫( ا ِْف َتعل‬ìfta’ala) َ َ ‫( اِنْفعل‬ìnfa’ala) َ ََ ‫( اِس َتفعل‬ìstaf’ala) َ َْ ْ 27
  • 52. Chapter 2 2.3.2 Nouns The morphological system of Arabic nouns contains three subcategories: 2.3.2.1 Primitive nouns (‫الجامدة‬ ‫:)الأسماء‬ The primitive nouns are nouns that can not be attached to a verbal root. They well form the fundamental glossary of the concrete language. eg: . . ‫كُ ْرسي‬ ّ ِ (kursiyy – chair), . َ ‫ك ْبش‬ ‫أخ‬ (raás – head), (kabš – Sheep), etc. In this category we also include nouns composed of two letters such as: . (áab – father), . ‫رأْس‬ َ ‫دم‬ (dam - blood), . ‫فم‬ (fam - mouth), . ‫أب‬ (áh – brother), etc. ّ 2.3.2.2 Nouns derived from verbals (‫المشتقة‬ ‫: )الأسماء‬ These are the nouns that can be derived from a verbal root. The number and nature of these forms vary depending on the status of the verb to which they relate. As nouns, they can receive marks of case, gender and indeterminacy. 2.3.2.3 Numbers : ‫صفر‬ ‫عشرون‬ This category of nouns is made up of simple numerals representing units: from . (sifr- zero, 0) to . ‫تسعة‬ (tisat – nine, 9); the tens: . (išruwn – twenty, 20) … and . ‫تسعون‬ ‫تسعة_عشَ ر‬ َ (ašarat – ten, 10), . (tisuwn – ninety, 90) ; the hundreds, etc. well as numerals compounds such as cardinals of . to . ‫عشرة‬ (tisat ašara - nineteen, 19). ‫أَحد_عشَ ر‬ َ َ (áahada ašara - eleven, 11) In their decomposition, the Arab grammarians have classified adjectives to nouns as they almost take all the morphological forms and may, for example, be definite or indefinite and flex according to case, number and type. 2.3.2.4 Demonstrative pronouns (‫الإ شارة‬ ‫:)أسماء‬ Demonstrative pronouns represent a subcategory of noun expressing an idea of demonstration. They can indicate that the object represented is found, either in the text, either in space or time, defined by the situation of utterance. They are two subsets: near-deictic (eg: . . َ‫ذلِك‬ ‫هذا‬ َ (dalika – that), . (hadaā – this), . َ‫أُولائِك‬ ‫هؤلاء‬ َُ (haẃulaā’ – these) and far-deictic (eg: (ùuwlaāýika – those), etc.). Demonstratives are deriv- able only to dual. 28
  • 53. Chapter 2 2.3.2.5 Relative pronouns (‫موصولة‬ ‫:) ٔاسماء‬ Relative pronouns relate to the noun or personal pronoun that precedes them and that we denote by antecedent. The relatives shall afford with their antecedents but are derivable only to dual (as demonstratives). Among the relative pronouns, we mention: . ‫الَّذي‬ dual), . (al-ladiy - that, masculine, singular), . ‫اللذين‬ َ (those, masculine, plural), etc. 2.3.2.6 Personal pronouns ( ِ‫الل َت ْين‬ (al-latayni – those, feminine, ‫:) الضمائر المنفصلة‬ Personal pronouns are intended to identify three types of grammatical persons: • First person, ie, the speaker (‫ , )المتكلم‬that who is talking: . . ‫نَحن‬ ُ ْ ‫أَنَا‬ (ánaā - I) or (nahnu – we) ; ‫( أَنْت‬áanta – َ ِ ْ‫( أَن‬áanti – you, feminine, singular), . ‫( أَنْ ُتما‬áanyou, masculine, singular), . ‫ت‬ َ tumaā – you, dual), . ‫( أَنْتم‬áantum – you, masculine, plural), . ‫( أَنْتن‬áantunna ُْ َّ ُ • Second person, ie, the listener (‫ ,)المخاطب‬that who talking to: . – you, feminine, plural); • Third person, ie, the absent (‫ , )الغائب‬that who talking about: . . . 2.3.3 ِ ‫هي‬ َ ‫هن‬ ُ (hiya – she), . ‫هما‬ َُ (humaā – they, dual), . (hunna – they, feminine). ‫هم‬ ُْ ‫ه َو‬ ُ (huwa – he), (hum – they, masculine), Function words The function words are used to locate entities, facts or objects in relation to time or place. They also play a key role in the coherence and sequencing of a text. For example, we have particles that designate a time: • . ‫بعد‬ (bada – after) • . ‫قبل‬ (qabla – before) • . ‫منذ‬ (mundu – since) or a place like • . ‫حيث‬ (haytu – where), According to their semantic meaning and their function in the sentence, they can play an important role in the interpretation of a sentence expressing an introduction, explanation, consequence, etc.[Kadri1992]. Function words include various categories, we mention: 29
  • 54. Chapter 2 • Prepositions : . ِ ‫في‬ (fiy – in) or . • Conjunctions: . ‫ثُم‬ َّ (tumma – then) ; • Adverbs : . ‫أَ َبدًا‬ ‫ع َلى‬ َ (abadã – never) or . in the normal way) ; • Quantifiers: . ‫كل‬ ُّ (kulla – all ) or . (alaą – on); ِ َ ْ ‫بِشكل_عادي‬ ‫َبعض‬ ْ (bišaklĩ aādiyyĩ – normally, (bada – some) ; • Etc. The function words are divided into subgroups: those variables (quantifiers) and those that are invariable (adverbs, prepositions, etc.). 2.4 Morphology There are several categories of Fusional languages, and Arabic is precisely in the category of languages with Intro-flexion: this category of languages, the consonants indicate the meaning and vowels mark the flexion of  word. This system is found especially in the Semitic languages (eg: Arabic, Hebrew) [Choueiter2006] Morphologically, the Arabic language is very rich and based on the structure of patterns and roots. Most Arabic words are generated from a finite set of roots (about 7000 roots) transformed using one or more patterns (about 400-500). Theoretically, a single Arabic root can generate hundreds of words (noun, verb, ...). An Arabic word can exist in about a hundred of forms in a normal text by adding certain suffixes and prefixes (mainly considered as stop-words in English).[AlKharashi1999] 2.4.1 Flexional Morphology Arabic uses for the declension of verbs and nouns, some indications of aspect, mood, time, person, gender, number and case, which are generally suffixes and prefixes[Gaudefroy1975]. Generally, these flexional marks can distinguish [?] : • Mode of verbs: eg, for the verb . ‫ذهب‬ َ َ (dahaba – to go), forms in the Perfective ‫( ذه ْبت‬dahabatu – I went) or ُ َ their prefixes such in the Imperfective (‫ )المضارع‬as . ‫( أَذهب‬áadhabu – I go) ; ُ َ ِ ُ َ Function of nouns: using of suffixes such as . ‫( رجلان‬rağulaáni – Two men) in Nominative (‫ )حالة الرفع‬or . ِ‫( رجلين‬rağulayni – Two men) in Accusative (‫حالة‬ َْ ُ َ ‫ )النصب‬or Genitive (‫]?[ .)حالة الجر‬ (‫ )الماضي‬can be identified using their suffixes as . • 30
  • 55. Chapter 2 2.4.1.1 Flexion of verbs Called also Conjugation, it describes the variation in their forms according to circumstances. Generally, conjugation includes a number of values which are: • Aspect: The aspect is a grammar feature associated, in most cases, to verbs in order to indicate which state it expresses; considered from the perspective of its development (beginning, progress, completion, overall evolution, etc.), regardless of when it comes ; • Mood : Mood indicates how the action expressed by the verb is designed and presented. The action can be doubted, affirmed as actual or eventual. They combine the semantics of verbs and thereby create aspects ; • Tense : Tense is a grammatical feature to locate a fact (which may be a state or action) in the enunciation time axis relative to the three markers: past, present and future. The temporal indications are often accompanied by aspectual indications that are more or less related. These three key values are closely related ; they can describe two basic forms of the verb in Arabic : • Perfective (‫ :)الماضي‬it indicates that the progress of the action expressed by the verb is finished, which means the past. It is characterized by adding suffixes of person, gender, number and mood to the verb’s stem. For example, for the feminine plural of the verb . get the form . ‫ك َت ْبن‬ َ َ ‫ك َتب‬ َ َ (kataba – to write), we add the suffix . ‫ن‬ َ to (katabna - *they* wrote, feminine) and for the masculine plural, we add the suffix . ‫وا‬ masculine) ; to get the form . َ ‫ك َت ُبوا‬ (katabuwá - *they* wrote, • Imperfective (‫ :)المضارع‬it indicates an unfinished progress, which may imply the present. It is characterized by adding a prefix and one or more infixes as a letter duplication or a vowel substitution. For example, for the verb . – to give), we can get . ُّ ُ ‫أَمد‬ (áamuddu – I give) or . ‫َيمددن‬ ُْ ْ َّ َ ‫مد‬ (madda (yamdudna – they gives, feminine). It includes two types of modal inflections: ◦ The indicative of actual mode where the speaker states the actual character (reread, to be achieved, in progress, etc.) of action or state expressed by the verb; ◦ The subjunctive of potential mode in which the speaker merely states the possible or virtual nature of action or state expressed by the verb. 31
  • 56. Chapter 2 • Imperative (‫ :)الأمر‬it expresses the order, command, or exhortation ... etc. It exists only the with the 2nd person in singular, dual and plural; 2.4.1.2 Flexion of nouns In Arabic, the declension (‫ )إعراب‬of nouns involves three cases: Nominative (‫,)مرفُوع‬ َْ Accusative (‫ )منصوب‬and Genitive (‫ .)مجرور‬Except for some special cases, the nouns are ْ ُ ْ declinables (‫ )معربة‬and appear in one of these three cases according to their functions in the sentence. In terms of the spelling, the case represents only an assistant graphic at the end of nominal forms. The nominal system of Arabic admits different systems depending on the nature of variation of the form (triptote, diptote, etc.) and the number thereof (singular, dual or plural). We can distinguish : 2.4.1.2.1 Declension of singular nouns: Basic declension of triptotes (‫ :)منصرف‬This is the most frequent case, it takes the vowel . ‫ضمة‬ َّ َ (dammat – u) as a sign of the nominative , the vowel . a) in the accusative and the vowel . ‫كَس َرة‬ ْ . ‫ة‬ ٍ‫ـ‬ (ta) or by . (fathat – (kasrat – i) in the genitive. When the noun is undefined, the tanwîn is marked respectively by the three diacritics: . (ã - an) et . ‫ف ْتحة‬ َ َ ٌ‫ـ‬ (ũ – un), . ‫ًـ‬ (ĩ – in). In the indefinite accusative , except the case of nouns ending by ‫ا‬ (ā’), an álif . ‫ا‬ (ā) strengthens the tanwîn . the accusative indefinite the noun . ِ ‫ك َتاب‬ (an) : for example, in (kitaāb – book) becomes . book, accusative, indefinite) and the book . (ğaziyratã – island, accusative, indefinite). Declension of diptotes(‫الصرف‬ ‫ًـ‬ ‫جزِيرة‬ َ َ ِ ‫ك َتا ًبا‬ (kitaābãā – (ğaziyrat – island) becomes . ‫:)الممنوع من‬ ‫جزِيرة‬ ًَ َ The nouns that are diptotes, gram- matically undefined, do not accept tanwîn and take the same mark in the accusative and the genitive which is the . ‫ف ْتحة‬ َ َ (fathat – a). By contrary, when they are defined, they follow the declension of triptotes. This is the case of feminine nouns that end ‫( اء‬ā’) such as . ‫( صح َراء‬sahraā’ – desert), masculine adjectives of colors with ْ ْ the pattern . ‫( أَفعل‬afal) such as . ‫( أحمر‬áhmar – red) and those which are feminine َ ْ with the pattern . ‫( فعلَاء‬falaā’) such as . ‫( بيضاء‬baydaā’ – white , feminine) َ َْ َْ with Declension of The Five Nouns (‫الخمسة‬ • Three nouns: . ‫ٔابو‬ (áabuw - father), . ‫ٔاخو‬ ‫:) الأسماء‬ The five nouns are : (áhuw - brother) and . ‫حمو‬ (hamuw - stepfather) ; • A variant of . َ ‫فم‬ (fam – mouth) : . َ ‫فا‬ ,. ‫فو‬ and . ِ ‫في‬ ; 32