Proposal of an Advanced Retrieval System for Noble Qur’an

Advanced Search/Indexing in Holy-Quran
Assem Chelli
2011 / 2012

Ministry of Higher Education and Scientiﬁc Research
National Higher School of Computer Science

Thesis Of Magister
Option : Mobile Destributed Computing (IRM)
Proposal of an Advanced Retrieval System for Noble Qur’an

Written By :

Supervised By :

• Assem CHELLI

• Pr. Amar BALLA
• Mr. Taha ZERROUKI

2011/ 2012

... and say : O my Lord ! have compassion on them, as they brought me up (when I was)
little. – Al-isra’ 24

iii

Acknowledgment

First at all, I am thanking Allah, the Almighty for giving me strength and patience to
write this modest thesis.
We gratefully acknowledge Pr. Amar Balla and Mr. Taha Zerrouki for giving me the
honor of their supervising during that year and guiding me with advices, and meaningful
criticism.
I also thank the jury members for agreeing to evaluate our modest work.
A big thanks to the faculty and the administration of the National Higher School of
Computer Science (ESI) who took care of my training and monitoring throughout the
study program.

Our deepest thanks go to the Arab open source community that has oﬀered me a great
support, especially Alfanous Team/Community that makes a valuable contribution to
carry out this great work, and I hope to be worthy of the conﬁdence they have placed on
me.
Finally I express my appreciation to all who contributed by their advice and their
encouragement to the completion of this work, my family and my friends for their
assistance and support.

iv

Abstract
Noble Quran is different of all documents that we have known. It’s the sacred book
of Muslims. It contains knowledge of all aspects of life. With this huge quantity of
information, we can extract only a small part manually and this is considered insufficient compared to the size of knowledge contained by Quran. That raises the need for
a method to extract those information because currently there is no efficient method
except many printed lexicons and many tools of simple sequential search with regular
expression. Due to this limitation, the Quran requires us to find new ways to interact.
The goal through this work is to propose a system for advanced research in all of
the information contained in the Quran by considering the morphology of the Arabic
language and the properties of the Qur’anic text. It should be based on modern methods of information retrieval for good stability and high speed search. It would be very
useful for researchers and could be generalized to cover all the content in Arabic.

Keywords : Indexing/Search, Arabic, Holy Quran, Information retrieval, Search
engines.

v

Résumé
Le Coran est différent de tous les documents que nous connaissons . C’est le livre
sacré des musulmans. Il comporte des connaissances sur tous les aspects de la vie. Avec
un tel volume d’informations, on ne peut y extraire qu’une infime partie manuellement.
Ceci s’avère être insuffisant vue la quantité de connaissances que contient le Coran. D’où
la nécessité de trouver une méthode pour extraire ces informations. Or il n’existe aucun
outil à utiliser sauf quelques lexiques imprimés et quelques outils de recherche simple
et séquentielle par les expressions régulières. En raison de cette limitation, le Coran
nous oblige à trouver de nouvelles façons d’interaction.
Le but recherché à travers ce travail est de proposer un système avancé de recherche
dans l’ensemble des informations contenues dans le Coran en prenant en considération
la morphologie de la langue Arabe et les propriétés du texte coranique. Elle doit être
fondée sur les méthodes modernes de recherche d’informations pour obtenir une bonne
stabilité et une recherche de grande vitesse. Elle serait trés utile pour les chercheurs et
pourrait être généralisée pour couvrir l’ensemble du contenu en arabe.

Mots clés : Indexation/Recherche, Arabe, Coran, Recherche d’information, Moteurs de recherche.

vi

‫ملخّص‬
‫القرآن الكريم يختلف عن جميع الوثائق التي نعرفها فهو يحوي المعارف في جميع جوانب الحياة. مع‬
‫هذا الحجم من المعلومات، لا يستطيع المرء استخراج إلا النزر اليسير يدويا وهذا ليس كافيا بالنسبة‬
‫لحجم المعارف الواردة في القرآن الكريم. ومن هنا جاءت الحاجة إلى إيجاد طريقة لاستخراج هذه‬
‫المعلومات. لا توجد أي وسيلة حالية فعالة باستثناء بعض المعاجم المطبوعة وبعض الأدوات التي‬
‫تعتمد البحث البسيط التسلسلي بالعبارات النمطية. وبسبب هذا القيد، يتوجب علينا إيجاد طرق‬
‫جديدة للتفاعل.‬
‫الهدف من خلال هذا العمل هو اقتراح نظام متقدم للبحث في جميع المعلومات الواردة في‬
‫القرآن الكريم، مع الأخذ بعين الاعتبار مورفولوجيا اللغة العربية وخصائص النص القرآني. وينبغي‬
‫الاستناد إلى الأساليب الحديثة في استرجاع المعلومات من أجل تحصيل استقرار جيد وسرعة عالية‬
‫في البحث. هذا العمل مفيد للباحثين ودارسي القرآن ويمكن أن يعمم ليشمل جميع المحتوى باللغة‬
‫َّ‬
‫العربية.‬
‫كلمات مفتاحية: بحث/فهرسة، العربية، القرآن، استخراج المعلومات، محركات البحث.‬

‫‪vii‬‬

Contents

Dedication

iii

Acknowledgment

iv

Table of Contents

xv

List of Figures

xvii

List of Tables

xix

List of Abbreviations

xx

Glossary

xxiv

General Introduction

1

I State Art

4

1 Search engines

5

1.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2

Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.1

Keyword

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.2

Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.3

Document

6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

1.2.4

Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2.5

Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.3

Search engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.4

Full-text search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.5

Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.5.1

Crawler Features . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.5.1.1

Features a crawler must provide . . . . . . . . . . . .

9

1.5.1.2

Features a crawler should provide . . . . . . . . . . .

9

Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.6.1

Deﬁnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.6.2

Indexing modes . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.6.2.1

Manual indexing . . . . . . . . . . . . . . . . . . . . .

11

1.6.2.2

Automatic indexing . . . . . . . . . . . . . . . . . . .

12

1.6.2.3

Semi-automatic indexing . . . . . . . . . . . . . . . .

13

Index types . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.6.3.1

Document Index

. . . . . . . . . . . . . . . . . . . .

13

1.6.3.2

Forward Index . . . . . . . . . . . . . . . . . . . . . .

14

1.6.3.3

Inverted index . . . . . . . . . . . . . . . . . . . . . .

14

1.6.3.4

N-gram index . . . . . . . . . . . . . . . . . . . . . . .

15

1.6.4

Index storage . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

1.6.5

Index update . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

1.6.5.1

Incremental update . . . . . . . . . . . . . . . . . . .

16

1.6.5.2

Global update . . . . . . . . . . . . . . . . . . . . . .

16

Indexing phases . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

1.6.6.1

Tokenization . . . . . . . . . . . . . . . . . . . . . . .

16

1.6.6.2

Normalization . . . . . . . . . . . . . . . . . . . . . .

17

1.6.6.3

Elimination of stop-words . . . . . . . . . . . . . . . .

17

1.6.6.4

Weighting . . . . . . . . . . . . . . . . . . . . . . . . .

17

Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

1.7.1

Relevance concept . . . . . . . . . . . . . . . . . . . . . . . . .

19

1.7.2

Similarity Function . . . . . . . . . . . . . . . . . . . . . . . . .

19

1.7.3

Search process . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

1.8

Semantic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

1.9

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

1.6

1.6.3

1.6.6

1.7

2 Arabic Language

25

2.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.2

Orthography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25
ix

2.3

Lexicography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.3.1

26

Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1.1

‫. :)الفعل‬
Verbs with augmented root (‫:)الفعل المزيد‬

. . . . . . .

26

. . . . . . .

27

Nouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.3.1.2
2.3.2

Verbs with a simple root (‫المجرد‬
ّ

2.3.2.1

‫. . . . . . :)الأسماء‬
ّ
Nouns derived from verbals (‫: )الأسماء المشتقة‬
Primitive nouns (‫الجامدة‬

. . . . .

28

. . . .

28

2.3.2.3

Numbers : . . . . . . . . . . . . . . . . . . . . . . . .

28

2.3.2.4

Demonstrative pronouns (‫الإ شارة‬

. . . . . . . .

28

. . . . . . . . .

29

. . . . . . . .

29

Function words . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.4.1

Flexional Morphology . . . . . . . . . . . . . . . . . . . . . . .

30

2.4.1.1

Flexion of verbs . . . . . . . . . . . . . . . . . . . . .

31

2.4.1.2

Flexion of nouns . . . . . . . . . . . . . . . . . . . . .

32

2.4.1.3

Flexion of function words . . . . . . . . . . . . . . . .

34

Derivational morphology . . . . . . . . . . . . . . . . . . . . . .

34

2.3.2.2

2.3.2.5
2.3.2.6
2.3.3
2.4

2.4.2

2.4.2.1
2.4.2.2

‫:) ٔاسماء‬

‫. . :) أسماء‬
Personal pronouns ( ‫:) الضمائر المنفصلة‬
Relative pronouns (‫موصولة‬

Deverbal noun (‫:)المصدر‬
Active participle (‫فاعل‬

. . . . . . . . . . . . . . . .

‫. . . . . . . . . . :)اسم‬
Passive participle (‫. . . . . . . . . :)اسم مفعول‬
Nouns of time and place (‫:)أسماء الزمان والمكان‬
Noun of instrument (‫. . . . . . . . . :)اسم الآلة‬
The Nomen Vicis (‫. . . . . . . . . . :)اسم المرة‬
The Nomen Speciei (‫. . . . . . . . . :)اسم الهيئة‬

35

. . . .

35

. . . .

35

. . . .

35

. . . .

35

. . . .

36

. . . .

36

Ambiguity issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.5.1

The absence of vocalization . . . . . . . . . . . . . . . . . . . .

36

2.5.2

Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.5.3

Suffixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.6

The computerization of Arabic language . . . . . . . . . . . . . . . . .

39

2.7

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

2.4.2.3
2.4.2.4
2.4.2.5
2.4.2.6
2.4.2.7
2.5

3 The Qur’an

41

3.1

Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.2

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.3

Qur’an Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.3.1

Fragmentation into surahs . . . . . . . . . . . . . . . . . . . . .

43

3.3.2

Fragmentation into Hizbs . . . . . . . . . . . . . . . . . . . . .

43
x

3.3.3

44

Qur’anic Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

3.4.1

Knowledge of Ayahs revelation places

. . . . . . . . . . . . . .

45

3.4.2

Knowledge of Ayahs revelation causes: . . . . . . . . . . . . . .

46

3.4.3

Knowledge of Morphology:

46

3.4.4

3.4

Fragmentation into Stops (Waqfs) . . . . . . . . . . . . . . . . .

‫. . . . . . : )علم مرسوم‬
Grammatical analysis of the Qur’an (‫: )إعراب ألفاظ القرآن‬
Science of allegorical ayahs (‫. . . . . . . . : )علم المتشابه‬
The beginnings of surahs (‫. . . . . . . . . . :)فواتح السور‬
ّ
Knowledge of Qur’anic Parables (‫. . . . . :)الأمثال القرآنية‬
Tafssīr (‫. . . . . . . . . . . . . . . . . . . . . . :)التفسير‬

. . . . . . . . . . . . . . . . . . . .

ّ
Knowledge of Orthography (‫الخط‬

. . . .

47

. . .

48

. . . .

48

. . . .

48

. . . .

49

. . . .

49

Computerization of the Qur’an . . . . . . . . . . . . . . . . . . . . . .

49

3.5.1

Advantages of Computerization . . . . . . . . . . . . . . . . . .

50

Qur’an Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

3.6.1

History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

3.6.1.1

Indexing words of the Qur’an . . . . . . . . . . . . . .

51

3.6.1.2

Ma’ājim of Qur’anic words . . . . . . . . . . . . . . .

51

3.6.1.3

Specialized Ma’ājim of Qur’anic words

. . . . . . . .

52

3.6.1.4

Qur’anic Indexes and Computer . . . . . . . . . . . .

52

Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.6.2.1

By unit: . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.6.2.2

By purpose . . . . . . . . . . . . . . . . . . . . . . . .

56

Projects of building indexes . . . . . . . . . . . . . . . . . . . .

58

3.6.3.1

Midād lbayān

58

3.6.3.2

Indexes by Taha Zerrouki

3.6.3.3

Qur’anic Arabic Corpus

3.4.5
3.4.6
3.4.7
3.4.8
3.4.9
3.5
3.6

3.6.2

3.6.3

. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .

60

Tanzil Project . . . . . . . . . . . . . . . . . . . . . .

62

3.6.3.5

Boundary-Annotated Qur’an Corpus . . . . . . . . . .

64

3.6.3.6

Qurany Concepts Tool . . . . . . . . . . . . . . . . . .

65

Qur’an Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

3.7.1

Qur’anic Concepts Ontology

. . . . . . . . . . . . . . . . . . .

67

3.7.2
3.8

59

3.6.3.4

3.7

. . . . . . . . . . . . .

The Ontology made by Hadj Henni: . . . . . . . . . . . . . .

68

Qur’anic Search Tools . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.8.1

69

Alawfa (‫. . . . . . . . . . . . . . . . . . . . . . . . . . . )الأوفى‬

3.8.2

Al-Monaqeb-Alqurany (‫القرآني‬

‫)المنقب‬

. . . . . . . . . . . . . .

70

3.8.3

Quran complex search service . . . . . . . . . . . . . . . . . . .

71

3.8.4

Quranic Researcher (‫القرآني‬

72

‫)الباحث‬

. . . . . . . . . . . . . . . .

xi

3.8.5

Quranologie (‫القرآن‬

. . . . . . . . . . . . . . . . . . . . . .

73

3.8.6

Quranic Corpus Word-by-Word Search . . . . . . . . . . . . . .

74

3.8.7

Tanzil (‫. . . . . . . . . . . . . . . . . . . . . . . . . . . . )تنزيل‬

75

Zekr (‫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . )ذكر‬

76

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

3.8.8
3.9

II

‫)علم‬

Analysis & Conception

78

4 Classification & Proposition of Qur’anic Search Features

79

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

4.2

Difficulties of Search in Quran

. . . . . . . . . . . . . . . . . . . . . .

79

4.3

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

4.4

Proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

4.4.1

Advanced Query . . . . . . . . . . . . . . . . . . . . . . . . . .

83

4.4.2

Output Improvements . . . . . . . . . . . . . . . . . . . . . . .

84

4.4.3

Suggestion Systems . . . . . . . . . . . . . . . . . . . . . . . . .

85

4.4.4

Linguistic Aspects . . . . . . . . . . . . . . . . . . . . . . . . .

87

4.4.5

Qur’anic Options . . . . . . . . . . . . . . . . . . . . . . . . . .

91

4.4.6

Semantic Queries . . . . . . . . . . . . . . . . . . . . . . . . . .

93

4.4.7

Statistical System . . . . . . . . . . . . . . . . . . . . . . . . .

96

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

4.5.1

4.5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

4.5.1.1

Survey Participants Details . . . . . . . . . . . . . . .

97

4.5.1.2
4.6

Survey

Results of survey . . . . . . . . . . . . . . . . . . . . .

99

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5 Conception

102

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2

Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.1

Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.2.2

Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.2.3

Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.2.4

Results Processing . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.2.5

Indexes Importing . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.3

Full vocalized search engine . . . . . . . . . . . . . . . . . . . . . . . . 107

5.4

Othmani script and text processing . . . . . . . . . . . . . . . . . . . . 111
5.4.1

Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.1.1

Romanizations . . . . . . . . . . . . . . . . . . . . . . 113
xii

5.4.1.2

Numbers into words:

. . . . . . . . . . . . . . . . . . 114

5.4.2
5.4.3

Normalization: . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.4.4

Filtering stop-words: . . . . . . . . . . . . . . . . . . . . . . . . 120

5.4.5
5.5

Tokenization: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Lemmatization: . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Qur’anic Word Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5.1
5.5.2

Semantically Related Words . . . . . . . . . . . . . . . . . . . . 125

5.5.3

Multi-level Derivations . . . . . . . . . . . . . . . . . . . . . . . 126

5.5.4

Speciﬁc Derivations . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.5.5
5.6

Word properties search . . . . . . . . . . . . . . . . . . . . . . . 124

Fuzzy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

III Implementation
6 Implementation

130
131

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.2

Why Open Source? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.1

License : AGPL

. . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2.2

Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2.3

Whoosh Search API . . . . . . . . . . . . . . . . . . . . . . . . 134

6.3

Previous Code Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.4

Our improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.4.1

A New Centralized JSON Output System: . . . . . . . . . . . . 138

6.4.2

Many new features . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.4.3

Resource Importing Manager . . . . . . . . . . . . . . . . . . . 142

6.4.4

Automating the API building . . . . . . . . . . . . . . . . . . . 143

6.4.5

A new console interface . . . . . . . . . . . . . . . . . . . . . . 143

6.4.6

Enhancing the web interface . . . . . . . . . . . . . . . . . . . . 143

6.4.7

Packaging system: . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.4.8

Multiple search units . . . . . . . . . . . . . . . . . . . . . . . . 144

6.4.9

Coding Standardization . . . . . . . . . . . . . . . . . . . . . . 145

6.4.10 Documentation covering . . . . . . . . . . . . . . . . . . . . . . 146
6.4.11 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.5

Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.5.1

Application Programming Interface . . . . . . . . . . . . . . . . 150
6.5.1.1

JSON web service . . . . . . . . . . . . . . . . . . . . 152

xiii

6.5.1.2
6.6

Console interface

. . . . . . . . . . . . . . . . . . . . 153

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

General Conclusion

156

Bibliography

158

Appendices

A1

Annex A: Paper Abstracts

A2

xiv

List of Figures

1.1

The various components of a web search engine . . . . . . . . . . . . .

9

1.2

Indexing Beneﬁts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.3

Search process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

1.4

Search results returned as a web page: one of the possible ways to expose
results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.1

Ligature of lâm and álif. . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.1

Explanatory diagram of fragmentation into Surahs . . . . . . . . . . .

43

3.2

Explanatory diagram of fragmentation into Hizbs . . . . . . . . . . . .

44

3.3

Index using the word as a unit [Arabic Quranic Corpus]

. . . . . . . .

53

3.4

Index that takes the ayah as a unit [Arabeyes Quran Model] . . . . . .

54

3.5

Classiﬁcation indexes by purpose . . . . . . . . . . . . . . . . . . . . .

56

3.6

The various structures of Qur’an . . . . . . . . . . . . . . . . . . . . .

57

3.7

Preview of Qurany . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

3.8

A closer look at Qur’an Concepts Ontology

68

3.9

Diagram of domain ontology of Qur’anic documents made by Hadj

. . . . . . . . . . . . . . .

Henni[Hadjhenni2008] . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.10 Preview of Alawfa website: www.alawfa.com . . . . . . . . . . . . . . .

70

3.11 Preview of Al Monaqeb Alqurany . . . . . . . . . . . . . . . . . . . . .

71

3.12 Preview of Quran Complex Search page . . . . . . . . . . . . . . . . .

72

3.13 Preview of Quranic Researcher (www.quranicresearcher.com) . . . . . .

73

3.14 Preview of Quranologie (quranologie.com) . . . . . . . . . . . . . . . .

74

3.15 Preview of Qur’anic Arabic Corpus Word By Word search . . . . . . .

75

3.16 Preview of Tanzil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

3.17 Preview of Zekr Application . . . . . . . . . . . . . . . . . . . . . . . .

77

4.1

84

Pages view in Google.com . . . . . . . . . . . . . . . . . . . . . . . . .

xv

4.2

Highlight the keyword

‫( وحيدا‬alone) in Ayah 11 of al-modather – al-

fanous.org . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

4.3

Ayah in full diacritical marks - quran.com . . . . . . . . . . . . . . . .

85

4.4

Query spell correction in Google.com . . . . . . . . . . . . . . . . . . .

85

4.5

Focusing on ”Yaqub” in Ontology of concepts of corpus.quran.com

. .

86

4.6

Related searches suggestion in Google.com . . . . . . . . . . . . . . . .

87

4.7

Keyboard mapping Arabic to English in Google.com . . . . . . . . . .

87

4.8

Different romanizations for the word

. . . . . . . . . . . . . . . .

88

4.9

Different transliterations used in ElixirFM Resolve Online . . . . . . .

88

4.10 Syntactic Coloration of Basmalah – bayt-al-hikma.com . . . . . . . . .

88

4.11 Google Voice Search on Android . . . . . . . . . . . . . . . . . . . . . .

89

4.12 Annotations shown by Quranic Arabic Corpus website . . . . . . . . .

90

4.13 Divine names Highlight in Quran Reader iPhone application . . . . . .

91

4.14 Faceted Thematic Browsing - Qurany Project . . . . . . . . . . . . . .

93

4.15 Audience Background

. . . . . . . . . . . . . . . . . . . . . . . . . . .

98

4.16 Audience experience . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

‫خليفَة‬
َ

4.17 Clarity, Usefulness, and Need percentage of each feature . . . . . . . . 100
5.1

Basic Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.2

The behavior of Searcher . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3

Text processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.4

Results processing phases . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.5

Different possible declensions of the word

5.6

Different types of the possible vocalizations of the word

5.7

Different writing forms of the word

5.8

Merging Words in Uthmani Script . . . . . . . . . . . . . . . . . . . . . 112

5.9

General Schema of Uthmani and Standard text processing. . . . . . . . 113

‫الملك‬

. . . . . . . . . . . . . 109

‫من‬

. . . . . . . 110

‫ بسطة‬using Othmani script

. . . . . 111

5.10 Example of Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.11 Tokenization of the word

‫فأَسق ْي َنٰكمو ُه‬
ُُ َْ َ

. . . . . . . . . . . . . . . . . . . . 116

5.12 Sub-tokens separation schema . . . . . . . . . . . . . . . . . . . . . . . 118
5.13 Example of tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.14 Arabic case markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.15 Example of normalization . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.16 Example of stop-word filtering
5.17 Examples of lemmatization

. . . . . . . . . . . . . . . . . . . . . . 121

. . . . . . . . . . . . . . . . . . . . . . . . 122

5.18 Two-Steps search behavior . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.19 Semantically related words : Idols in Quran . . . . . . . . . . . . . . . 124
5.20 Word properties search example : First person, Plural, Masculine . . . 125
xvi

5.21 Searching through an ontology . . . . . . . . . . . . . . . . . . . . . . . 125
5.22 Semantically Related Words, Hyponymy of the word

‫( نبي‬prophet) .

. . 126

5.23 Multi-level Derivation Search example . . . . . . . . . . . . . . . . . . 127
5.24 Special derivations example, Imperative of

‫( قال‬to say)

. . . . . . . . . 128

6.1

Screenshot of the Qt desktop interface . . . . . . . . . . . . . . . . . . 136

6.2

Json results of

6.3

Fuzzy search example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.4

Showing adjacent ayahs . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.5

Showing ayahs in diﬀerent scripts . . . . . . . . . . . . . . . . . . . . . 141

6.6

Suggestion example of Vocalizations , Derivations ,and Synonyms of

6.7

Annotations of the keyword

6.8

Buckwalter translation example . . . . . . . . . . . . . . . . . . . . . . 142

6.9

Fields table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

‫الكوثر‬

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

‫. قيمة‬

‫قول‬

141

. . . . . . . . . . . . . . . . . . . . . 141

6.10 Translation-as-unit search , Query: seven

. . . . . . . . . . . . . . . . 145

6.11 Word-as-unit json outout, Query:

‫قيمة‬

6.12 Interfaces dependency hierarchy

. . . . . . . . . . . . . . . . . . . . . 150

. . . . . . . . . . . . . . . . . . 145

6.13 API usage sample code . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.14 Preview of the JSON web service . . . . . . . . . . . . . . . . . . . . . 153
6.15 Preview of the Console interface . . . . . . . . . . . . . . . . . . . . . . 153

xvii

List of Tables

1.1

Document Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

1.2

Forward Index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

1.3

Inverted Index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

1.4

N-gram index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.1

3 types of Arabic letters: 1 form, 2 forms or 4 forms . . . . . . . . . . .

26

2.2

The change of meaning by changing the diacritical marks . . . . . . . .

36

2.3

The change of function by changing diacritical marks . . . . . . . . . .

37

2.4

Ambiguities of preﬁxes . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.5

Ambiguities due to suﬃxes . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.1

Waqfs types [Web-Islamweb] . . . . . . . . . . . . . . . . . . . . . . . .

44

3.2

Some numerical miracles of Qur’an[Nawfal1975] . . . . . . . . . . . . .

50

3.3

Index based on word parts . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.4

Index based on sentences– surah: al-fātiha . . . . . . . . . . . . . . . .

56

3.5

Overview of main index– Midād lbayān . . . . . . . . . . . . . . . . . .

58

3.6

Overview of words index – M.Taha Zerrouki . . . . . . . . . . . . . . .

59

3.7

Overview of the topics index– Taha Zerrouki . . . . . . . . . . . . . . .

60

3.8

Overview of the index of synonyms – Taha Zerrouki . . . . . . . . . . .

60

3.9

Overview of morphology index – Quranic Arabic corpus . . . . . . . . .

62

3.10 Example of simple proper Quranic text – Tanzil.info . . . . . . . . . .

63

3.11 Example of Surah index – Tanzil.info . . . . . . . . . . . . . . . . . . .

63

3.12 Sajdah index – Tanzil.info . . . . . . . . . . . . . . . . . . . . . . . . .

64

3.13 Example of rub’ index – Tanzil.info . . . . . . . . . . . . . . . . . . . .

64

3.14 Sample of Boundary Annotated Qur’an Corpus . . . . . . . . . . . . .

65

5.1

Partial vocalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.2

Some numbers as they mentioned in Quran

. . . . . . . . . . . . . . . 115

xviii

6.1

Search request ﬂags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.2

Pylint Analysis stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.3

Implementation State of search features

. . . . . . . . . . . . . . . . . 148

xix

List of Abbreviations
AGPL

Aﬀero General Public License.

API

Application Programming Interface.

GPL

GNU Public License.

GUI

Graphical User Interface.

IDF

Inverse Document Frequency.

OWL

Web Ontology Language.

PC

Personal Computer.

POS

Part Of Speech.

POS

Part Of Speech.

RSV

Retrieval Status Value.

TF

Term Frequency.

TF*TDF

term frequency - inverse document frequency.

UI

User Interface
.

xx

Glossary

Abrogated ayahs
Abrogating ayahs
Accusative
Active Participle
Active voice
Allegorical ayah
Assimilated verb
Attaching pronoun
Ayah
Basmalah
Book Science
Broken plural
Conjugation
Declension
Declinable
Defect
Demonstrative pronoun
Deverbal noun
Diacritical marks
Diphthong
Diptote
Dual form
Expansion

‫الآيات_المنسوخة‬
‫الآيات_الناسخة‬
‫حالة_النصب‬
‫اسم_الفاعل‬
‫صيغة_مبني_للمعلوم‬
‫الآيات_المتشابهات‬
ِ
‫فعل_م َثال‬
‫ضمير_م َّتصل‬
‫آية‬
‫بسملة‬
‫علم_الكتاب‬
‫جمع_تكسير‬
‫تصريف_الأفعال‬
‫الإ عراب‬
‫معرب‬
‫علّة‬
‫اسم_إشارة‬
‫مصدر‬
‫علامات_التشكيل‬
‫الإ دغام‬
‫ممنوع_من_الصرف‬
‫المثنّى‬
‫الزيادة‬
ّ
xxi

External feminine plural
External masculine plural
External plural
Fiqh
First ayahs of surah
First person
Fusional language
Geminated verb
General and Particular
Genitive
Hamzah
Hamzated verb
Healthy verb
Hizb
Hollow verb
Imperative
Imperfective
Instrument noun
Internal plural
Inversion
Jussive
Juz’
Lam-ALef
Last ayahs of surah
Laws
Lemma
Lexicology
Makkan
Medinan

‫جمع_مؤنث_السالم‬
‫جمع_مذكر_السالم‬
‫جمع_سالم‬
‫الفقه‬
‫فواتح_السورة‬
‫المتكلم‬
‫لغة_إشتقاقية‬
‫فعل_مضاعف‬
َ َ ُ
ّ‫الخاص_والعام‬
ّ
‫حالة_الجر‬
‫همزة‬
‫فعل_مهموز‬
َْ
‫فعل_صحيح‬
ِ
‫ح ْزب‬
‫فعل_أَج َوف‬
ْ
‫الأمر‬
‫المضارع‬
‫اسم_الآلة‬
‫جمع_تكسير‬
‫القلب‬
‫حالة_الجزم‬
‫ج ْزء‬
ُ
‫لام_ألف‬
‫خواتيم_السورة‬
‫الأحكام‬
‫جذع_الكلمة‬
‫علم_المعاجم‬
‫مكّ ّية‬
‫مدن ّية‬
xxii

Migration of Prophet
Morphology
Mus’haf
Narration of Hadiths
Nisf
Nomen Speciei
Nomen Vicis
Nominative
Noun of place
Noun of time
Object
Orthography
Othmani script
Passive Participle
Passive voice
People of the book
Perfective
Personal pronoun
Plural form
Primitive noun
Prophet’s Sunnah
Prosody
Prostration of recitation
Qiblah
Qur’anic comma
Qur’anic Parable
Recitation
Relative pronoun
Revelation Science
Rewayate
Rewayate of Kaloun

‫الهجرة_النبوية‬
‫الصرف‬
‫المصحف‬
ُ
‫رواية_الأحاديث‬
‫نِصف‬
ْ
‫اسم_الهيئة‬
‫اسم_المرة‬
َّ
‫حالة_الرفع‬
‫اسم_مكان‬
‫اسم_زمان‬
‫مفعول‬
ّ
‫علم_مرسوم_الخط‬
ّ
‫الخط_العثماني‬
‫اسم_المفعول‬
‫صيغة_مبني_للمجهول‬
‫أهل_الكتاب‬
‫الماضي‬
‫ضمير_منفصل‬
‫الجمع‬
‫اسم_جامد‬
‫الس َّنة_النبوية‬
ُ
‫تقطيع_العروض‬
‫سجود_التّلاوة‬
ِ
‫القبلة‬
‫فاصلة_قرآنية‬
‫م َثل_قرآني‬
َ
‫التلاوة‬
‫اسم_موصول‬
‫علم_التنزيل‬
‫رواية‬
‫رِواية_قالون‬
xxiii

Rhetoric
Root
Rubu’
Sajdah
Second person
Singular form
Standard script
Subject
Superlative noun
Surah
Surah keys
Tafssir
The Five Nouns
Third person
Thumn
Translation of Qur’an
Triptote
Verb with a simple root
Verb with augmented root
Virtues of surah
Waqf
Weakened verb

‫البلاغة‬
‫جذر_الكلمة‬
‫ر ُبع‬
ُ
‫سجدة‬
‫المخاطب‬
‫المفرد‬
‫الخط_الإ ملائي‬
‫فاعل‬
‫اسم_تفضيل‬
‫سورة‬
‫مفاتيح_السورة‬
ّ
‫علم_تفسير_القرآن‬
‫الأسماء_الخمسة‬
‫الغائب‬
‫ثُمن‬
ُ
‫ترجمة_معاني_القرآن‬
‫منصرف‬
‫فعل_مجرد‬
ّ
‫فعل_مزيد‬
‫فضائل_السورة‬
‫َوقف‬
ِ
‫فعل_نَاقص‬

xxiv


Work Context
Qur’an, in Arabic, means the read or the recitation. Muslim scholars define it as: the
words of Allah revealed to His Prophet Muhammad, written in Mus’haf and transmitted
by successive generations (‫[)التواتر‬Mahssin1973]. The Qur’an is also known by other
names such as: Al-Furkān , Al-kitāb , Al-dhikr , Al-wahy and Al-rōuh . It is the
sacred book of all Muslims and the first reference to Islamic law. It’s more then 14
centuries passed since its revelation, and the Muslims are still studying it, teaching it,
writing books about it and recently developing applications for it.
Qur’an is an important source of information that contains various information
about all aspects of life: Scientific, Social, Historic, Politic...etc.

Problematic
Due to the large amount of information held in the Qur’an, it has become extremely
difficult for regular search engines to successfully extract key information. For example,
When searching for a book related to English grammar, you’ll simply Google it, select
a PDF file and download it. That’s all! Search engines (like Google) are used generally
on Latin letters and for searching general information of document like content, title,
author…etc. However, searching through Qu’ranic text is a much more complicated;
It’s procedure that’s requiring a much more in depth solution as there is a lot of
information that needs to be extracted to fulfill Qur’an scholar’s needs. Before the
creation of computer, Qur’an scholars were using printed lexicons made manually. The
printed lexicons can’t help much since many search process waste the time and the
force of the searcher. Each lexicon is written to reply to a specific query which is
generally simple. Nowadays, there are many applications that are specific for search
needs; most of applications that were developed for Qur’an had the search feature but

1

in a simply way: sequential search with regular expressions.
The simple search using exact query does not offer better options and still inefficient
to move toward Thematic search by example. Full text search is the new approach
of search that replaced the sequential search and which is used in search engines.
Unfortunately, this approach is not applied yet on Qur’an. The question is why we
need this approach? Why search engines? Do applications of search in Qur’an really
need to be implemented as search engines?

Objectives
Our proposal is about design a retrieval system that fit the Qur’an search needs. But
to realize this objective, we must first list and classify all the search features that are
possible and helpful. Then we need to study how to implement each feature and what
is its requirements.

Report organization
We organized the report as follows:

First Part : Art State
This part contains 3 chapters:
Chapter 1 : Search Engines
To design a powerful search engine, it is essential to understand how search engines
work, in this chapter we discuss the different parts of a search engine, namely: the
crawling, indexing and querying . And the definition of basic concepts in the field of
information retrieval systems. This chapter contains an introduction to the semantic
approach.
Chapter 2 : Arabic Language
The objective of this chapter is to present the properties of the Arabic language,
its spells, its morphology and to introduce some ambiguity issues that raise due to the
Arabic nature ... etc.
Chapter 3 : The Qur’an
This chapter presents an overview of the Qur’an and its sciences, it has a historical
background on the evolution of the Qur’an, the structure of the Mus’haf, and the
main problems of computerization of the Qur’an, including the script Uthmani and
authentication Qu’ranic texts.

2


Second Part : Analysis & Conception
This part contains two chapters :
Chapter 4 : Qur’anic search features
The objective of this chapter is to present the possible search features in Qur’an.
It has a big importance in our work since it defines our objectives and our path on
the work. We’ve make a survey about Usefulness, Need, and Clarity of each feature in
order to validate our points of view in choosing those features.
Chapter 5 : Conception
In this chapter, we start by a preview on our previous work then we’ll propose many
improvements to carry out all the feasible search features mentioned in the previous
chapter.

Third Part : Implementation
This part contains the different steps of implementation of our retrieval system. It
includes one chapter:
Chapter 6 : Implementation
This chapter describes the choice of technologies and development tools and also
presents the prototype with a description of various features.
Finally, we finish the report with a conclusion that summarizes our work. We
include an appendix that describes the papers published about this work. Actually
there are two papers:
• An Arabic paper in NITS 2011 KSA entitled ”An Application Programming Interface for indexing and search in Noble Quran”1 [Chelli2011].
• An English paper in a pre-conference workshop in LREC 2012 Turkey which is
about ”LRE-Rel: Language Resource and Evaluation for Religious Texts”. The
paper was entitled ”Advanced Search in Quran: Classification and Proposition of
All Possible Features”[Chelli2012].

1

Arabic title:

‫مكتبة برمجية للفهرسة والبحث في القرآن الكريم‬
3

Chapter 1
Search engines

How could the world beat a path to your door when the path
was uncharted, uncatalogued, and could be discovered only
serendipitously?
Paul Gilster, Digital Literacy

1.1

Introduction

Our work falls within the field of Information Retrieval, as it aims to design a search
engine, in this chapter we will discuss how search engines work by explaining its main
components.
Exploration is the part that feeds the search engine by documents that it collects,
but with the amount of information that becomes larger and larger, it is necessary to
develop methods of search, only indexing able to accelerate search in very large systems
such as the Web, because it anticipates the search by extracting and arranging them
keywords.
So that search results be satisfactory, we must properly calculate the relevance of
results against the query, this is done during the interrogation. The question must also
be able to express simple questions as well as complex questions.
The quality of research is directly related to the quality of the crawling, indexing
and search, these three operations can be considered as the core of search engine, the
objective of this chapter is to define the main concepts of this area, starting with
defining the crawling, then study indexing, its methods and steps, and then we shall
explain the process of search and the notion of relevance.
5

Chapter 1

1.2

Definitions

1.2.1

Keyword

Word or set of words chosen to represent the contents of a document, and find it in
document search. It can be coming from the document (title, text, abstract, ...) or a
controlled vocabulary..[Hensens1998]

1.2.2

Descriptor

Keyword selected from a set of equivalent terms to represent clearly a concept. It is usually part of an organized and hierarchical vocabulary of type ”thesaurus”.[Hensens1998]

1.2.3

Document

A document can be text, a piece of text, web page, image, video, etc. We call Document
any unit that can be an answer to a user query. For textual documents, there are many
forms regarding their specification. A document can be a text without any structure
(it is also called full-text) and may also be a text with a structured part (document
partially structured or semi structured) or fully structured. [Amrouche2008]

1.2.4

Query

A query expresses the need for information of a user. Various types of query languages
have been proposed to formulate a query. A query can be expressed:
• In natural (or almost) language (eg: ”find all the manufacturing facilities of cars
and their addresses”)[Salton1971]
• In a structured format, also called Boolean query language (eg: ”cars and factories
and brand”)[Bourne1979]
• As graphical language from a GUI [Lelu1992]

1.2.5

Relevance

Relevance is a word that simply means returning the information considered the most
useful at the top of a result list. While the definition is simple, getting a program to
compute relevance is not a trivial task, mainly because the notion of usefulness is hard
for a machine to understand. [Bernard2009]

6

Chapter 1

1.3

Search engines

A search engine is software that allows to regain resources (web pages, images, video,
files, information ... etc.) related to any words. Some websites offer a search engine as
the main feature, called then search engine the website itself (Google, Yahoo, Bing ...
are search engines). [Nejjari2007]
A search engine is also a crawling tool on the web made up of ”robots” that explore
the websites periodically and automatically (without human intervention, that is what
distinguishes search engines from directories). They follow the links (pages that link
to each other) encountered on each page reached. Each identified page will be indexed
in a database, then will be accessible by Internet users using keywords.[Sanan2008]
Search engines do not apply only to Web: some engines are softwares installed on
personal computers. These are known as desktop search engines , they aims the search
in the files stored on the PC - include such Exalead Desktop, Google Desktop and
Copernic Desktop Search ... etc.
In December 2004, Tim Berners Lee (the inventor of the World Wide Web) talked
about a new project: ”Semantic Web” which is based on processing of the web information automatically according to their significances. The reason for this was that 80% of
Web contains texts intended to be read and understood by humans. While computer
programs, Web browsers and search engines are unable to understand this content, so
they are unable to speed up the search. In less than two years from the article of Lee,
the First foundations of Semantic Web were formed. They seemed to lead the world
toward a new revolution in Internet and search engines .[Abulhajjaj2009]
At first glance, nothing distinguishes a classic search engine from a semantic one.
The same sparse interface, with a text box in the center of the page where the user can
enter his search query. In fact, the difference lies in the search mode. A classic search
engine , as Google, works as follows: its robots index browse the pages and index the
words. Then store these words in a gigantic database. Users can do search by send
their queries and a search algorithm retrieve the results and sort them in a certain
order based on their relevance.[Mentre2008]

1.4

Full-text search

Full-text search is a technology focused on finding documents matching a set of words
.
While sounding like a mouthful, full-text search is more common than you might
think. You probably have been using full-text search today. Most of the web search

7

Chapter 1

engines such as Google and Yahoo! use full-text search engines at the heart of their
service. The differences between each of them are recipe secrets (and sometimes not
so secret), such as the Google PageRankTM algorithm. PageRankTM will modify the
importance of a given web page (result) depending on how many web pages are pointing
to it and how important each page is .
Be careful, though; these so-called web search engines are way more than the core
of full-text search: They have a web UI , they crawl the web to find new pages or
existing ones, and so on. They provide business-specific wrapping around the core of
a full- text search engine.
Given a set of words (the query), the main goal of full-text search is to provide
access to all the documents matching those words. Because sequentially scanning all
the documents to find the matching words is very inefficient, a full-text search engine
(its core) is split into two main operations: indexing the information into an efficient
format and searching the relevant information from this precomputed index. From the
definition, you can clearly see that the notion of word is at the heart of full-text search;
this is the atomic piece of information that the engine will manipulate. [Bernard2009]

1.5

Crawling

Crawling is the process by which we gather pages from the Web, in order to index
them and support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them. the web crawler is sometimes referred to as a spider .
[Manning2009]
This process is in the phase preceding the indexing phase, see Figure:

8

Chapter 1

Figure 1.1: The various components of a web search engine

1.5.1

Crawler Features

We list the desiderata for web crawlers in two categories: features that web crawlers
must provide, followed by features they should provide. [Manning2009]
1.5.1.1 Features a crawler must provide
Robustness :

The Web contains servers that create spider traps, which are gener-

ators of web pages that mislead crawlers into getting stuck fetching an inﬁnite number
of pages in a particular domain. Crawlers must be designed to be resilient to such
traps. Not all such traps are malicious; some are the inadvertent side-eﬀect of faulty
website development .
Politeness :

Web servers have both implicit and explicit policies regulating the

rate at which a crawler can visit them. These politeness policies must be respected .
1.5.1.2 Features a crawler should provide
Distributed :

The crawler should have the ability to execute in a distributed

fashion across multiple machines .
Scalable :

The crawler architecture should permit scaling up the crawl rate by

adding extra machines and bandwidth .

9

Chapter 1

Performance and efficiency :

The crawl system should make efficient use of

various system resources including processor, storage and network bandwidth.
Quality :

Given that a significant fraction of all web pages are of poor utility for

serving user query needs, the crawler should be biased towards fetching “useful” pages
first.
Freshness :

In many applications, the crawler should operate in continuous mode:

it should obtain fresh copies of previously fetched pages. A search engine crawler,
for instance, can thus ensure that the search engine’s index contains a fairly current
representation of each indexed web page. For such continuous crawling, a crawler
should be able to crawl a page with a frequency that approximates the rate of change
of that page.
Extensible :

Crawlers should be designed to be extensible in many ways – to

cope with new data formats, new fetch protocols, and so on. This demands that the
crawler architecture be modular.

1.6

Indexing

To make the research cost acceptable, it should pass by an essential phase in the
document database. This phase consists in analyzing each document in the collection to create a set of keywords: we call it the indexing phase. These keywords will
be more easily used by the system during the subsequent process of search. Indexing create a representation of documents in the system. Its objective is to find the
most important concepts of the document (or query), which form the descriptor of
document.[Sauvagnat2005]

1.6.1

Definition

Indexing is the act of describing or classifying a document by index terms or other
symbols in order to indicate what the document is about, to summarize its content
or to increase its find-ability. In other words, it is about identifying and describing
the subject of documents. Indexes are constructed, separately, on three distinct levels:
terms in a document such as a book; objects in a collection such as a library; and
documents (such as books and articles) within a field of knowledge.
The process of indexing begins with any analysis of the subject of the document.
The indexer must then identify terms which appropriately identify the subject either
10

Chapter 1

by extracting words directly from the document or assigning words from a controlled
vocabulary. The terms in the index are then presented in a systematic order. Indexers
must decide how many terms to include and how specific the terms should be. Together
this gives a depth of indexing.[Lancaster2003]
Indexing is most often used to information retrieval. But it can also be used in other
areas such as automatic classification of documents, keyword suggestion, co-occurring
terms calculating, automatic summarization, etc.[Abar2009]

Figure 1.2: Indexing Benefits

1.6.2

Indexing modes

1.6.2.1 Manual indexing
Manual indexing is achieved by a human expert (librarian or specialist in the field )
that analyzes the content of the text to identify the terms representing the document.
Manual indexing ensures greater relevance in the answers, because it identifies a
more specific keywords describing a document.
However, it has several drawbacks, there is the problem of used vocabulary and
the dependence on indexer’s knowledge on the topic, ie the same document can be
indexed in several ways (according to vision of the person who makes the indexing),
and an indexer at two different times can have two distinct terms to represent the same
concept.
The major drawback of this method is the cost in time, this method is not therefore
appropriate when the number of documents to be indexed is substantial. [Sauvagnat2005,
Abar2009, Amrouche2008]
11

Chapter 1

Manual indexing is based on four key points [Chartron1989] :
• reading the entire document for preparation ;
• consideration of descriptors, objectives (applications) and user needs;
• permanent complementarity between the terms of manual indexing and abstract;
• in the absence of appropriate descriptor, and when the emergence of a new concept is not explicit enough to propose a candidate descriptor, the ability to use
a close or generic descriptor.
So we thought fast enough to use the computer[Mustafa].
1.6.2.2 Automatic indexing
Automatic indexing is a set of automated processing phases applied on documents. We
distinguish: Tokenization (automatic extraction of word), Elimination of stop words,
Stemming (Lemmatization or radicalization), Scoring of words and finally the creation
of the index[Sauvagnat2005].
The first approach to the automatic indexation KWIC (Key Word In Context-)
was introduced by Luhn (1957)[Luhn1957]. There was discussion about to weight the
index. In the early days of information retrieval, statistical methods were based on the
frequency of words in the document. Later, this measure was extended to take into
account the specificity of a term for the document. To this end, other methods have
been exploited, such as 2-Poisson (Nie, 2003)[Gaussier2003][Mustafa].
The automatic indexing systems use several methods of analysis:
1.6.2.2.1

Linguistic analysis:

Technology issued from ”text mining”, the latter

is to implement a simplified model of linguistic theories in computer systems of learning.
This is part of the artificial intelligence field . [Allab2008]
The linguistic method consists of several modules of linguistic analysis: morphological, lexical, syntactic and pragmatic. The fact that some systems use indexing
techniques of natural language processing, demonstrates the relevance of a linguistic
approach. [Elhachani1997]
1.6.2.2.2

Statistical analysis:

The initiator of the methods of the automatic

indexation is H.P. Luhn with his influential article “The automatic creation of literature abstracts” published in 1958 in the “Journal of Research and Development” of
IBM. He states : « (...) instead of sampling at ran- dom, as a reader normally does
when scanning, the new mechanical method selects those among all the sentences of
12

Chapter 1

an article that are the most representative of pertinent information », H. P. Luhn
opened the door to work on automatic indexing by proximity also called statistical
method[Luhn1958].
Automatic indexing involves the following steps :
• Extracting words (tokenization): the extraction rules are language-dependent.
• Eliminating stop words (stop words): these are words too frequent but unnecessary. Example: the, a, of, or ... etc.
• Stemming : for example the stem of the word ”stemmers” is ”stem”.
• Transformation rules: removal of plural endings.
• Truncation: choose an optimal value of truncation of words. It is better to
truncate suffixes. There is no absolute rule for this.
1.6.2.3 Semi-automatic indexing
The two previous techniques can be combined, a first automatic process to extract
the terms of the document. However the final choice remains the expert in the field
or librarian to establish semantic relations between keywords and choose the significant terms using a thesaurus or a terminology database which is an organized list of
descriptors (keywords) obeying specific terminology rules[Abar2009, Sauvagnat2005,
Hadjhenni2008].

1.6.3

Index types

The index is the output of the indexing process, there are several types of indexes
according to the used technique and the desired function:
1.6.3.1 Document Index
The document index keeps information about each document. It is an index ISAM
(Index sequential access mode) with a fixed width, ordered by the ID of the document.
The information stored in each entry includes data, a checksum of documents and
various statistics. If the document was crawled, it also contains a pointer to a variable
width file called the document information that contains the URL and title. This
design decision was driven by the desire to have a relatively compact data structure,
and the ability to find a record in one disk traversal, when queried.[Brin1998]
The following table is a simplified illustration of a document index:

13

Chapter 1

Document ID
Document 1
Document 2
Document 3

Text
The cow says moo
The cat and the hat
The dish ran away with the spoon

Link
/ex/doc1.txt
/ex/doc2.txt
/ex/doc3.txt

Table 1.1: Document Index

1.6.3.2 Forward Index
The forward index stores a list of words for each document. The following is a simplified
form of forward index:

Document ID
Document 1
Document 2
Document 3

Words
the, cow, says, moo
the, cat, and, the, hat
the, dish, ran, away, with, the, spoon
Table 1.2: Forward Index

The rationale behind developing a forward index is that as documents are parsing,
it is better to immediately store the words per document. The delineation enables
Asynchronous system processing, which partially circumvents the inverted index update bottleneck. The forward index is sorted to transform it to an inverted index. The
forward index is essentially a list of pairs consisting of a document and a word, collated
by the document. Converting the forward index to an inverted index is only a matter
of sorting the pairs by the words. In this regard, the inverted index is a word-sorted
forward index. [Brin1998]
1.6.3.3 Inverted index
Many search engines include an inverted index when evaluating a search query to
quickly retrieve documents that contain words in the query and then sort them by
relevance. Since the inverted index stores the list of documents containing each word,
the search engine can use direct access to find documents associated with each word in
a query to retrieve documents that respond quickly. The following table is a simplified
illustration of an inverted index:

14

Chapter 1

Word
the
cow
says
moo

Documents
Document 1, Document 3, Document 4, Document 5
Document 2, Document 3, Document 4
Document 5
Document 7
Table 1.3: Inverted Index

This index can identify only if a word exists in a particular document because it
does not store any information regarding the frequency or the position of the word. It
is considered an index of boolean. This index determines which documents that match
a query, but does not classify them. In some models, the index includes additional
information such as frequency of each word in each document or positions of a word
in each document. The position information allow the search algorithm to identify the
adjacent words to support the search by phrases. The frequency can be used to assist
calculating the relevance of documents to the query. [Grossman2002, Tang2004]
1.6.3.4 N-gram index
An n-gram is a sequence of n consecutive characters. For any document, all n-grams
(usually n takes the values 2 or 3) we can generate, is the result obtained by shifting a
window of n squares on the body text. This shift occurs in steps, one step corresponds
to a character. Then we calculate the frequencies of n-grams found. for example

1

[Jalam2002]the french sentence ”La nourrice nourrit le nourrisson” is represented by :

n-grams
Frequencies

1
la_
1

2
a_n
1

3
_no
3

4
nou
3

5
our
3

6
urr
3

7
rri
3

8
ric
1

9
ice
1

10
_ce
1

11
e_n
2

12
rit
1

…
…
…

Table 1.4: N-gram index

One beneﬁt of n-grams is automatic tracking of the most common stems [Grefenstette1995]:
dans in the previous example, using techniques based on n-grams we ﬁnd the common
root of : Nourrir, nourri, nourrit, nourrissez, nourriture, etc. Tolerance to spelling
mistakes and distortions is also an important property. [Sanan2008]

1.6.4

Index storage

Storage of index structures is mainly characterized per the index size and organization
of its elements. Index structures vary widely in their use of size that is closely related
1

the character ”_” is used instead of spaces, in order to facilitate the reading.
15

Chapter 1

to the organization of data in the index.
This organization has a significant impact on latency of search. More items are
closely related to each other in the storage space is less latency research, this is called
the concept of locality. It is also very important that the index can hold in main
memory, it avoids disk access to the system and reduces the latency of search.
The ideal index is one that occupies less space and minimize search latency. [Dahak2006]

1.6.5

Index update

Updating the index refers to the behavior of applying changes on the index . Changes
can be insertions, modifications or deletions. An index can be more or less able to
adapt to these changes. This adaptation can occur in two forms:
1.6.5.1 Incremental update
In the case of an incremental update, the structure of the index is updated by adding
back the indexes of new documents without modifying existing ones. The number of
changes in this case is, however, often limited.[Dahak2006]
1.6.5.2 Global update
The third case, and worst is when the structure of the entire index must be rebuilt
from scratch.[Dahak2006]

1.6.6

Indexing phases

The indexing process consists of the following phases:
1.6.6.1 Tokenization
Tokenization is a phase that may seem trivial at first, and yet provide the basis for
the rest of the indexing phases. Therefore this phase must be done with the highest
quality.[Meylan2001]
Some retrieval systems use a list of predefined keywords. This list is designed
manually and, in most cases built for a specific topic. This method allow to control
the index size. The use of automatic extraction of keywords or the use of a list of
predefined keywords, determines the type of indexing. Document-oriented in the first
case and query-oriented in the second.[Berrut1997, Dahak2006]

16

Chapter 1

1.6.6.2 Normalization
This processing is to find for a word its normalized form (usually the masculine for
nouns, infinitive for verbs, the masculine singular for adjectives, etc.). Thus, in the
index are stored only in their normalized forms, which offers a significant size saving,
but more importantly, even if the processing is done on the request, it can be much
quicker and more flexible in research: for example, if a user searches with a verb,
documents that contains this verb in all its conjugated forms will be considered, not
just documents containing the word in the form provided by the user. This step is also
called ”morphological processing of keywords”[Denoyer2004]
This phase can also be enriched with syntactic and semantic processing of keywords.
The first is to identify and group a set of words whose meaning depends on their union.
For example, the words ”White House” does not usually mean you’re dealing with a
house that is white, but instead the seat of the presidency of the United States. It is
also to remove ambiguities such as the problems of homography.
Semantic processing is intended to make distinctions between different possible
meanings of a word (polysemy). For example, this phase helps differentiate the word
”room” that can match a coin, or a room in a house. This is an arduous task that
is not currently well controlled and its effect on system performance is not always
proven.[Dahak2006]
1.6.6.3 Elimination of stop-words
This phase is of some importance since it constitutes a factor of great influence in the
accuracy of the search. The failure to remove stop words inevitably cause noise. The
elimination of stop words which are words of everyday language and do not contain
much semantic information must be both in indexing as querying (removing stop-words
from the query). [Dahak2006]
1.6.6.4 Weighting
This step is entirely dependent on the model of information retrieval used. It defines
how important a term in a given document. [Dahak2006]
In general, most term weighting formulas are built by combination of two factors. A
local weighting factor measuring the local representativity of a term in the document,
and an overall weighting factor measuring the global representativity of a term with
respect to the collection of documents[Amrouche2008].
That leads to two types :

17

Chapter 1

1.6.6.4.1

Local weighting Local weighting takes into account the local informa-

tion of the term that depend only on the document. It is typically a function of
frequency of occurrence of the word in the document, denoted tf (Term Frequency).
A term that frequently appears in a document is considered relevant to describe its
contents. [Dahak2006]
1.6.6.4.2

Overall weighting The overall weight measures the importance of a

term within all documents. It aims to represent its discriminatory nature, or in other
words its ability to distinguish between document. In fact, a term appearing in few
documents is considered more discriminatory and should be favored over a term found
in many documents. The calculation of the overall weighting is based on the number
of documents in which a term appears. One of the most used is idf (Inverse Document
Frequency), represented by the following formula:
N
Idf = log( ni )

Such as ni is the number of documents containing the word i and N is the total
number of documents.
The value tf *idf gives a good approximation of the importance of a term in the
document, particularly in the corpus of documents of similar size. [Dahak2006]

1.7

Querying

Querying is the phase of interaction between the system and the user. This expresses
the need for information via a query language that the system will take care of interpreting. This interpretation is done according to the query template and is designed to
understand user needs and express them in a formalism similar to the one used when
indexing documents. This process provides an inner query. Following this phase of
query interpreting, a matching pattern calculates the match between the inner query
and each document in the index. This calculation established by the mapping function,
has traditionally resulted in an ordered list of documents. It should, at this level, a
semantic comparison (not equal) between concepts in of document and those of the
query.
The comparison between query and document rarely leads to strict equivalences,
but rather to partial equivalences: the document is only part of the query. The ﬁrst
document in the list returned by the system is one that is considered by the system
as the most relevant, that is to say the one that best suits the query, again according
to the system. The ﬁnal document is one that is considered by the system as the
18

Chapter 1

least relevant. This notion of relevance is based on the proximity between the needs
expressed by the user and the results provided by the system.[Dahak2006]

1.7.1

Relevance concept

Relevance is a central concept of the query because all evaluations are based around
this concept. But it is also the most poorly understood concept, despite numerous
studies on this concept as the one in[Denos1997].
Let us see some definitions of the relevance. Relevance is:
• The correspondence between a document and a query, a measure of informativeness of the document to the query;
• A degree of relationship (overlap, relativity, ...) between the document and the
query;
• A degree of surprise that comes with a document that is relevant to the needs of
the user;
• A measure of usefulness of the document to the user.
Even in these definitions, the used concepts (informativeness, relativity, surprise ...)
remains very vague because users have very different needs. They have very different
criteria for judging whether a document is relevant. So the notion of relevance is used
to cover a very wide range of criteria and relations[Dahak2006].

1.7.2

Similarity Function

Comparing between the document and query is equivalent to calculating a score, assumed to represent the relevance of the document in respect to the query. This value
is calculated from a function or a probability of similarity denoted rsv(q,d) (retrieval
status value), such as q is a query and d est a document and whose formula depends
entirely on the used model of information retrieval. This measure takes into account
the weight of terms in documents determined by statistical analysis and probability.
The matching function is very closely related to the operations of indexing and weighting of query terms and documents in the corpus. In general, the matching document
- query and indexing model used to characterize and identify a model of information
retrieval. The similarity function is then used to order the documents returned to
the user. The quality of this ordering is paramount. In fact, users is generally satisfied to examine the first documents (the top 10 or 20). If the documents sought
are not present in this slice, the user will consider sorting as bad in respect to his
query[Sauvagnat2005, Dahak2006].
19

Chapter 1

1.7.3

Search process

Search takes a user query and returns the effective list of matching results sorted by
relevance. Such as indexing, searching is a multiphase process, as shown in Figure.

Figure 1.3: Search process

The first operation is about building the query. Depending on the full text search,
the way to express query is either: :
1. String based—A text-based query language. Depending on the focus, such a
language can be as simple as handling words and as complex as having Boolean
operators, approximation operators, field restriction, and much more!
2. Programmatic API based—For advanced and tightly controlled queries a programmatic API is very neat. It gives the developer a flexible way to express
complex queries and decide how to expose the query flexibility to users.
Some tools will focus on the string-based query, some on the programmatic API, and
some on both.
The second operation, let’s call it analyzing, is responsible for taking sentences or
lists of words and applying the similar operation performed at indexing time (chunk
20

Chapter 1

into words, stems, or phonetic description). This is critical because the result of this
operation is the common language that indexing and searching use to talk to each other
and happens to be the one stored in the index. If the same set of operations is not
applied, the search won’t find the indexed words—not so useful!
Based on the common language between indexing and searching, the third operation
(finding documents) will read the index and retrieve the index information associated
with each matching word. Remember, for each word, the index could store the list
of matching documents, the frequency, the word positions in a document, and so on.
The implicit deal here is that the document itself is not loaded, and that’s one of the
reasons why full-text search is efficient: The document does not have to be loaded
to know whether it matches or not. The next operation (filtering and ordering) will
process the information retrieved from the index and build the list of documents (or
more precisely, handlers to documents). From the information available (matching
documents per word, word frequency, and word position), the search engine is able
to exclude documents from the matching list. More important, it is able to compute
a score for each document. The higher its score, the higher a document will be in the
result list. let’s have a look at some factors influencing its value :
• In a query involving multiple words, the closer they are in a document, the higher
the rank.
• In a query involving multiple words, the more are found in a single document,
the higher the rank.
• The higher the frequency of a matching word in a document, the higher the rank.

• The less approximate a word, the higher the rank.
Depending on how the query is expressed and how the product computes score, these
rules may or may not apply. This list is here to give you a feeling of what may affect
the score, therefore the relevance of a document. Once the ordered list of documents
is ready, the full-text search engine exposes the results to the user. It can be through
a programmatic API or through a web page. the following figure shows a result page
from the Google search engine.[Bernard2009]

21

Chapter 1

Figure 1.4: Search results returned as a web page: one of the possible ways to expose results

1.8

Semantic Approach

Semantic search seeks to improve search accuracy by understanding searcher intent and
the contextual meaning of terms as they appear in the searchable dataspace, whether
on the Web or within a closed system, to generate more relevant results. Semantic
search systems consider various points including context of search, location, intent,
variation of words, synonyms, generalized and specialized queries, concept matching
and natural language queries to provide relevant search results[Web-Techulator]. Major
web search engines like Google and Bing incorporate some elements of semantic search.
Rather than using ranking algorithms such as Google’s PageRank to predict relevancy, semantic search uses semantics, or the science of meaning in language, to
produce highly relevant search results. In most cases, the goal is to deliver the information queried by a user rather than have a user sort through a list of loosely
related keyword results. However, Google itself has subsequently also announced its
own Semantic Search project[Web-WSJ].
Other authors primarily regard semantic search as a set of techniques for retrieving
knowledge from richly structured data sources like ontologies. Such technologies enable
the formal articulation of domain knowledge at a high level of expressiveness and could
enable the user to specify his intent in more detail at query time[Web-ESWC2012].
Semantic search does not just mean contextual search or search based on the intend of the question. It include several other factors as well. A smart search engine
would consider several factors to provide the most relevant and useful search queries,

22

Chapter 1

including[Web-Techulator]:
• Current trend: If the president election was just finished in the country and
someone is searching for ’Who is the new president’, the semantic search system
should be able to understand the query and give relevant results based on the
current trend and news.
• Location of search: If a person is searching for ’what is the temperature’, the
semantic search engine should be able to provide results based on the current
location of the search. If the person is searching from California, search results
should include the current temperature in California.
• Intend of the search: Semantic search engines should be able to give appropriate search results based on the intent of the search and not based on the specific
words used in the search query.
• Variations of words in Semantic Search: Semantic search should consider
tenses, plural, singular etc and provide relevant search results for all semantic
variations of the words. For example, words like dog, dogs, dog’s etc.
• Synonyms and Semantic Search: A semantic search engine should be able
to understand the synonyms and give more or less the same search results on any
synonyms of the word users search for. For example, try searching for ”biggest
mountain” and ”highest mountain”. You would get pretty much the same results
since both of them means the same in this particular query, even though the
”biggest” and ”highest” could mean different things in different cases.
• Generalized and Specialized queries: Semantic Searching engine should be
able to set relation between generalized and specialized queries and provide appropriate and relevant results. For example, consider an article on general health
topics and another article specifically on Diabetes. If someone search for health
information, both articles could match even though the article on Diabetes does
not talk specifically about ”health”.
• Concept matching: This is a sub-set of context matching in semantic search.
Semantic search should understand the broad concept of the query and return
relevant results. For example, a query on ”Traffic problems in New Jersey” could
return relevant results including the topics ”narrow roads”, ”non functioning
traffic lights”, ”lack of roadside assistance” etc because in a broad conceptual
point of view, all of these lead to traffic problems.

23

Chapter 1

• Natural language queries: Not everyone are tech savvy and not many people
know what to search to get the relevant search results. Most users simply type
in queries in natural language. For example, if some one want to find what is
the current time in Arizona, USA, they would search for ’What time is it in
Arizona’. Most search engines would simply show results from the websites and
articles that talk about Time and Arizona. However, smart search engines that
use Semantic Search would actually show you the current time in Arizona, USA.
Try it yourself at Google search.
• Change of meaning based on the group of words: By combining different
words, the true meaning of search term could change. Consider the following
search terms:
◦ New egg health products
◦ New egg health benefits
If you search for both the above terms in Google, you would get completely
different meaning. Instead of just picking the results based on the words, Google
Search looks at it as a term and then combines with common user search pattern.
The first term returns search results primarily on the popular online shopping
website NewEgg.com and shows results of health products from that site and
similar sites. The second term shows search results for the health benefits of
Egg.
Semantic Search is a big challenge for search engines and none of them are perfect.
Most search engines have improved significantly in last few years. Search engines like
Bing and Google provide significantly relevant search results incorporating some degree
of semantic search. There are many other specialized search engines (like Hakia) which
offer purely semantic search results, but they lack many other qualities of normal search
engines.

1.9

Conclusion

In this chapter, the study focused on the working mechanism of search engines and
information retrieval systems, based on indexing due to its importance. Indeed, it is
the most important step in the search process as it allows the extraction and processing
of keywords.
The search phase does not offer only the interaction between users and the system,
but also calculates the match percentage between the query and the documents to
provide the most relevant results.
24

Chapter 2
Arabic Language

‫أَنا البحر في أَحشائ ِِه الدر كامـــن *** فهل سأَلوا الغواص عن صدَفاتـــي‬
َ َ َ َّ
ٌ ِ ُّ ُ
ُ َ
َ ََ
‫حافظ ابراهيم، قصيدة عن اللغة العربية‬

2.1

Introduction

Arabic (

‫ ) العربية‬is a name applied to the descendants of the classical Arabic language

of the sixth century AD, the most widely used in the Qur’an, the Islamic holy book.
Arabic is a Central Semitic language, closely related to modern Hebrew and Aramai
languages1 .
In this chapter, we will talk about orthography and morphology of the Arabic
language that are unique. we will also talk about some ambiguities that may appear
in Arabic due to the absence of vocalization.

2.2

Orthography

The Arabic script is one of the most used scripts all over the world. It dominates in
the Arab countries, of course, but holds a special place for all Muslims because it is
the script used to write the Qur’an.[Jabri]
It is written from right to left like other Semitic languages. Its alphabet has twentynine2 consonant letters, three of them .

‫يوا‬

are considered as vowels. Optionally,

1

Modern Aramaic, languages are varieties of Aramaic that are spoken vernaculars in the medieval
to modern era, evolving out of Middle Aramaic dialects around AD 1200.
2
The scientists of Arabic language considered the Hamzah as a letter
25

Chapter 2

one of the three Diacritical marks .

‫ـِ ُـ َـ‬

can be placed after certain characters to

resolve the ambiguity in pronunciation and/or direction when it arises. In a fully
vocalized Arabic text, the lack of diacritics can be regarded as a sokōune .

‫ْـ‬

(silence).

In some cases, a letter doubled, may be replaced by a single letter with tashdeed .
(reinforcement) placed above. [AlKharashi1999]

ّ‫ـ‬

In addition, it is important to note that the notion of uppercase and lowercase
letter does not exist, the Arabic writing is called unicameral. In addition, Arabic is
a language semi cursive, most letters are attached to each other, their spellings diﬀer
depending on whether they are preceded and/or followed by other letters or they are
isolated. Only six of them does not attach to the following letter : .
one letter does not attach at all : .

‫ء‬

‫وزرذدا‬

and

[Mesfar2008]

Letter
Hamzah
Wâw
Ayn

Spellings

‫ء‬
‫و ـو‬
‫ـع ـعـ عـ ع‬

Table 2.1: 3 types of Arabic letters: 1 form, 2 forms or 4 forms

2.3

Lexicography

The traditional Arabic grammar has only three subsets: Nouns, Verbs and Particles.

2.3.1

Verbs

A verb is an entity expressing a time-dependent sense. Most Arabic verbs are formed

‫ك َتب‬
َ َ
and eventually four consonants that is the case of the verb . ‫دحرج‬
َ َْ َ

on three radical consonants that is the case of the verb .

(kataba – write)
(dahrağa – roll

along). These roots may form several patterns as a result of one or more morphological
transformations (eg: repetition of a consonant, lengthening a vowel, the expanding of
a morpheme, etc.), it comes in this case to roots with augmented pattern.
Several linguistic studies have been conducted on the verbal system in Arabic,
see[Larcher2003]. In this section, it is necessary to introduce a classiﬁcation of verbs
according to their radicals:
2.3.1.1 Verbs with a simple root (‫المجرد‬
ّ

‫:)الفعل‬

A verb with a simple root has a base of three consonants called radical consonants.
These verbs are associated with verbal pattern .

‫فعل‬
َ ََ

(fa’ala). When none of the root

26

Chapter 2

consonants of the verb is a long vowel, it is called healthy. These radicals may involve

ِ
processing or causes of defects (‫ ,)علَّة‬we mention :
• The presence of .

ٔ‫ا‬

(á – hamzä), .

‫ي‬

(y - yâ’) or .

‫و‬

(w – wâw) among the

radical consonants. Depending on the position of that, we distinguish different
types of verbs :
◦ If one of the root consonants is .
= Hamzated verb (‫مهموز‬
َْ

ٔ‫ا‬

(á – hamzä), independently of its position

‫; )فعل‬

‫و‬

◦ The first radical consonant is a .
(‫مثال‬
َِ

‫;)فعل‬

◦ The second radical consonant is a .

‫;)أَج َوف‬
ْ

◦ The third radical consonant is a .

ِ
‫;)نَاقص‬

‫و‬

(w) or .

‫ي‬

(y) = Assimilated verb

(w) or .

‫ي‬

(y) = Hollow verb (‫فعل‬

‫و‬

(w) or .

‫ي‬

(y) = Weakened verb (‫فعل‬

• The presence of two identical consonants in the second and third position of the
root = Geminated verb (‫مضاعف‬
َ َ ُ

‫.)فعل‬

2.3.1.2 Verbs with augmented root (‫المزيد‬

‫:)الفعل‬

The patterns of verbs with augmented root are formed from simple roots by a set of
morphological operations to provide a specific meaning to the outcome verbs , we
mention:
• .
• .
• .
• .
• .
• .
• .
• .

‫فعل‬
َّ َ
‫فاعل‬
ََ َ
‫أَفعل‬
َ َْ

(fa’ala)
(faā’ala)
(áfa’ala)

‫( َتفعل‬tafa”ala)
َ َّ َ
‫( َتفَاعل‬tafaā’ala)
ََ
‫( ا ِْف َتعل‬ìfta’ala)
َ َ

‫( اِنْفعل‬ìnfa’ala)
َ ََ
‫( اِس َتفعل‬ìstaf’ala)
َ َْ ْ

27

Chapter 2

2.3.2

Nouns

The morphological system of Arabic nouns contains three subcategories:
2.3.2.1 Primitive nouns (‫الجامدة‬

‫:)الأسماء‬

The primitive nouns are nouns that can not be attached to a verbal root. They well
form the fundamental glossary of the concrete language. eg: .
.

‫كُ ْرسي‬
ّ ِ

(kursiyy – chair), .

َ
‫ك ْبش‬

‫أخ‬

(raás – head),

(kabš – Sheep), etc. In this category we also include

nouns composed of two letters such as: .
(áab – father), .

‫رأْس‬
َ

‫دم‬

(dam - blood), .

‫فم‬

(fam - mouth), .

‫أب‬

(áh – brother), etc.

ّ
2.3.2.2 Nouns derived from verbals (‫المشتقة‬

‫: )الأسماء‬

These are the nouns that can be derived from a verbal root. The number and nature of
these forms vary depending on the status of the verb to which they relate. As nouns,
they can receive marks of case, gender and indeterminacy.
2.3.2.3 Numbers :

‫صفر‬
‫عشرون‬

This category of nouns is made up of simple numerals representing units: from .
(sifr- zero, 0) to .

‫تسعة‬

(tisat – nine, 9); the tens: .

(išruwn – twenty, 20) … and .

‫تسعون‬

‫تسعة_عشَ ر‬
َ

(ašarat – ten, 10), .

(tisuwn – ninety, 90) ; the hundreds, etc. well

as numerals compounds such as cardinals of .
to .

‫عشرة‬

(tisat ašara - nineteen, 19).

‫أَحد_عشَ ر‬
َ َ

(áahada ašara - eleven, 11)

In their decomposition, the Arab grammarians have classified adjectives to nouns
as they almost take all the morphological forms and may, for example, be definite or
indefinite and flex according to case, number and type.
2.3.2.4 Demonstrative pronouns (‫الإ شارة‬

‫:)أسماء‬

Demonstrative pronouns represent a subcategory of noun expressing an idea of demonstration. They can indicate that the object represented is found, either in the text,
either in space or time, defined by the situation of utterance. They are two subsets:
near-deictic (eg: .
.

َ‫ذلِك‬

‫هذا‬
َ

(dalika – that), .

(hadaā – this), .

َ‫أُولائِك‬

‫هؤلاء‬
َُ

(haẃulaā’ – these) and far-deictic (eg:

(ùuwlaāýika – those), etc.). Demonstratives are deriv-

able only to dual.

28

Chapter 2

2.3.2.5 Relative pronouns (‫موصولة‬

‫:) ٔاسماء‬

Relative pronouns relate to the noun or personal pronoun that precedes them and that
we denote by antecedent. The relatives shall aﬀord with their antecedents but are
derivable only to dual (as demonstratives). Among the relative pronouns, we mention:
.

‫الَّذي‬

dual), .

(al-ladiy - that, masculine, singular), .

‫اللذين‬
َ

(those, masculine, plural), etc.

2.3.2.6 Personal pronouns (

ِ‫الل َت ْين‬

(al-latayni – those, feminine,

‫:) الضمائر المنفصلة‬

Personal pronouns are intended to identify three types of grammatical persons:
• First person, ie, the speaker (‫ , )المتكلم‬that who is talking: .
.

‫نَحن‬
ُ ْ

‫أَنَا‬

(ánaā - I) or

(nahnu – we) ;

‫( أَنْت‬áanta –
َ
ِ ْ‫( أَن‬áanti – you, feminine, singular), . ‫( أَنْ ُتما‬áanyou, masculine, singular), . ‫ت‬
َ
tumaā – you, dual), . ‫( أَنْتم‬áantum – you, masculine, plural), . ‫( أَنْتن‬áantunna
ُْ
َّ ُ

• Second person, ie, the listener (‫ ,)المخاطب‬that who talking to: .

– you, feminine, plural);

• Third person, ie, the absent (‫ , )الغائب‬that who talking about: .
.
.

2.3.3

ِ
‫هي‬
َ
‫هن‬
ُ

(hiya – she), .

‫هما‬
َُ

(humaā – they, dual), .

(hunna – they, feminine).

‫هم‬
ُْ

‫ه َو‬
ُ

(huwa – he),

(hum – they, masculine),

Function words

The function words are used to locate entities, facts or objects in relation to time or
place. They also play a key role in the coherence and sequencing of a text. For example,
we have particles that designate a time:
• .

‫بعد‬

(bada – after)

• .

‫قبل‬

(qabla – before)

• .

‫منذ‬

(mundu – since)

or a place like
• .

‫حيث‬

(haytu – where),

According to their semantic meaning and their function in the sentence, they can
play an important role in the interpretation of a sentence expressing an introduction,
explanation, consequence, etc.[Kadri1992]. Function words include various categories,
we mention:
29

Chapter 2

• Prepositions : .

ِ
‫في‬

(fiy – in) or .

• Conjunctions: .

‫ثُم‬
َّ

(tumma – then) ;

• Adverbs : .

‫أَ َبدًا‬

‫ع َلى‬
َ

(abadã – never) or .

in the normal way) ;
• Quantifiers: .

‫كل‬
ُّ

(kulla – all ) or .

(alaą – on);

ِ َ ْ
‫بِشكل_عادي‬

‫َبعض‬
ْ

(bišaklĩ aādiyyĩ – normally,

(bada – some) ;

• Etc.
The function words are divided into subgroups: those variables (quantifiers) and those
that are invariable (adverbs, prepositions, etc.).

2.4

Morphology

There are several categories of Fusional languages, and Arabic is precisely in the
category of languages with Intro-flexion: this category of languages, the consonants
indicate the meaning and vowels mark the flexion of word. This system is found
especially in the Semitic languages (eg: Arabic, Hebrew) [Choueiter2006]
Morphologically, the Arabic language is very rich and based on the structure of
patterns and roots. Most Arabic words are generated from a finite set of roots (about
7000 roots) transformed using one or more patterns (about 400-500). Theoretically, a
single Arabic root can generate hundreds of words (noun, verb, ...). An Arabic word
can exist in about a hundred of forms in a normal text by adding certain suffixes and
prefixes (mainly considered as stop-words in English).[AlKharashi1999]

2.4.1

Flexional Morphology

Arabic uses for the declension of verbs and nouns, some indications of aspect, mood,
time, person, gender, number and case, which are generally suffixes and prefixes[Gaudefroy1975].
Generally, these flexional marks can distinguish [?] :
• Mode of verbs: eg, for the verb .

‫ذهب‬
َ َ

(dahaba – to go), forms in the Perfective

‫( ذه ْبت‬dahabatu – I went) or
ُ َ
their prefixes such in the Imperfective (‫ )المضارع‬as . ‫( أَذهب‬áadhabu – I go) ;
ُ َ
ِ ُ َ
Function of nouns: using of suffixes such as . ‫( رجلان‬rağulaáni – Two men) in
Nominative (‫ )حالة الرفع‬or . ِ‫( رجلين‬rağulayni – Two men) in Accusative (‫حالة‬
َْ ُ َ
‫ )النصب‬or Genitive (‫]?[ .)حالة الجر‬
(‫ )الماضي‬can be identified using their suffixes as .

•

30

Chapter 2

2.4.1.1 Flexion of verbs
Called also Conjugation, it describes the variation in their forms according to circumstances. Generally, conjugation includes a number of values which are:
• Aspect: The aspect is a grammar feature associated, in most cases, to verbs in
order to indicate which state it expresses; considered from the perspective of its
development (beginning, progress, completion, overall evolution, etc.), regardless
of when it comes ;
• Mood : Mood indicates how the action expressed by the verb is designed and
presented. The action can be doubted, affirmed as actual or eventual. They
combine the semantics of verbs and thereby create aspects ;
• Tense : Tense is a grammatical feature to locate a fact (which may be a state or
action) in the enunciation time axis relative to the three markers: past, present
and future. The temporal indications are often accompanied by aspectual indications that are more or less related.
These three key values are closely related ; they can describe two basic forms of the
verb in Arabic :
• Perfective (‫ :)الماضي‬it indicates that the progress of the action expressed by
the verb is finished, which means the past. It is characterized by adding suffixes
of person, gender, number and mood to the verb’s stem. For example, for the
feminine plural of the verb .
get the form .

‫ك َت ْبن‬
َ َ

‫ك َتب‬
َ َ

(kataba – to write), we add the suffix .

‫ن‬
َ

to

(katabna - *they* wrote, feminine) and for the masculine

plural, we add the suffix .

‫وا‬

masculine) ;

to get the form .

َ
‫ك َت ُبوا‬

(katabuwá - *they* wrote,

• Imperfective (‫ :)المضارع‬it indicates an unfinished progress, which may imply
the present. It is characterized by adding a prefix and one or more infixes as a
letter duplication or a vowel substitution. For example, for the verb .
– to give), we can get .

ُّ ُ
‫أَمد‬

(áamuddu – I give) or .

‫َيمددن‬
ُْ ْ

َّ َ
‫مد‬

(madda

(yamdudna – they

gives, feminine). It includes two types of modal inflections:

◦ The indicative of actual mode where the speaker states the actual character
(reread, to be achieved, in progress, etc.) of action or state expressed by the
verb;
◦ The subjunctive of potential mode in which the speaker merely states the
possible or virtual nature of action or state expressed by the verb.
31

Chapter 2

• Imperative (‫ :)الأمر‬it expresses the order, command, or exhortation ... etc. It
exists only the with the 2nd person in singular, dual and plural;
2.4.1.2 Flexion of nouns
In Arabic, the declension (‫ )إعراب‬of nouns involves three cases: Nominative (‫,)مرفُوع‬
َْ
Accusative (‫ )منصوب‬and Genitive (‫ .)مجرور‬Except for some special cases, the nouns are
ْ
ُ ْ
declinables (‫ )معربة‬and appear in one of these three cases according to their functions
in the sentence. In terms of the spelling, the case represents only an assistant graphic
at the end of nominal forms. The nominal system of Arabic admits different systems
depending on the nature of variation of the form (triptote, diptote, etc.) and the number
thereof (singular, dual or plural). We can distinguish :
2.4.1.2.1 Declension of singular nouns:
Basic declension of triptotes (‫ :)منصرف‬This is the most frequent case, it takes
the vowel .

‫ضمة‬
َّ َ

(dammat – u) as a sign of the nominative , the vowel .

a) in the accusative and the vowel .

‫كَس َرة‬
ْ

.

‫ة‬

ٍ‫ـ‬

(ta) or by .

(fathat –

(kasrat – i) in the genitive. When the noun is

undefined, the tanwîn is marked respectively by the three diacritics: .
(ã - an) et .

‫ف ْتحة‬
َ َ

ٌ‫ـ‬

(ũ – un), .

‫ًـ‬

(ĩ – in). In the indefinite accusative , except the case of nouns ending by

‫ا‬

(ā’), an álif .

‫ا‬

(ā) strengthens the tanwîn .

the accusative indefinite the noun .

ِ
‫ك َتاب‬

(an) : for example, in

(kitaāb – book) becomes .

book, accusative, indefinite) and the book .
(ğaziyratã – island, accusative, indefinite).
Declension of diptotes(‫الصرف‬

‫ًـ‬

‫جزِيرة‬
َ َ

ِ
‫ك َتا ًبا‬

(kitaābãā –

(ğaziyrat – island) becomes .

‫:)الممنوع من‬

‫جزِيرة‬
ًَ َ

The nouns that are diptotes, gram-

matically undefined, do not accept tanwîn and take the same mark in the accusative
and the genitive which is the .

‫ف ْتحة‬
َ َ

(fathat – a). By contrary, when they are defined,

they follow the declension of triptotes. This is the case of feminine nouns that end

‫( اء‬ā’) such as . ‫( صح َراء‬sahraā’ – desert), masculine adjectives of colors with
ْ
ْ
the pattern . ‫( أَفعل‬afal) such as . ‫( أحمر‬áhmar – red) and those which are feminine
َ ْ
with the pattern . ‫( فعلَاء‬falaā’) such as . ‫( بيضاء‬baydaā’ – white , feminine)
َ َْ
َْ
with

Declension of The Five Nouns (‫الخمسة‬
• Three nouns: .

‫ٔابو‬

(áabuw - father), .

‫ٔاخو‬

‫:) الأسماء‬

The five nouns are :

(áhuw - brother) and .

‫حمو‬

(hamuw

- stepfather) ;
• A variant of .

َ
‫فم‬

(fam – mouth) : .

َ
‫فا‬

,.

‫فو‬

and .

ِ
‫في‬

;

32

Proposal of an Advanced Retrieval System for Noble Qur’an

Proposal of an Advanced Retrieval System for Noble Qur’an

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Proposal of an Advanced Retrieval System for Noble Qur’an

Similar to Proposal of an Advanced Retrieval System for Noble Qur’an (20)

More from Assem CHELLI

More from Assem CHELLI (8)

Recently uploaded

Recently uploaded (20)

Proposal of an Advanced Retrieval System for Noble Qur’an