Master Thesis

University Of Sfax
Faculty of Economics and Management of Sfax
Multimedia, InfoRmation Systems and Advanced Computing Laboratory
M A S T E R T H E S I S
to obtain the title of
Master Degree in Computer Systems and new Technology
Defended by
Imad Eddin Jerbi
Construction and Morpho-syntactic
Annotation of a Colloquial Corpus:
Case of Tunisian Arabic
Supervisor: Mariem Ellouze
Co-supervisor: Inès Zribi - Rahma Boujelbane
defended on 27th
December 2013
Jury:
President : Lamia Hadrich Belguith Professor (FSEGS)
Reviewer : Maher Jaoua Associate Professor (FSEGS)
Advisor : Mariem Ellouze Associate Professor (ESCS)
Invited : Inès Zribi University of Provence
Rahma Boujelbane University of Sfax

Acknowledgments
I would like rst to express my thanks to the head of ANLP-RG
Mrs. Lamia Hadrich Belguith for accepting me in the research group.
Above all, I would like to express my deepest appreciation to my supervisor
Mrs. Mariem Ellouze Khemakhem and the co-supervisor Miss. Inès Zribi and
Miss. Rahma Boujelbane - you have been a constant source of encouragement and
guidance, and your faith in me is largely responsible
for not only completing this thesis but also enjoying working on it.
I would like to thank the jury members:
Mr. Maher Jaoua, Mrs. Lamia Hadrich Belguith,
Mrs. Mariem Ellouze, Miss. Inès Zribi and Miss. Rahma Boujelbane
for their precious time reading my thesis and for their constructive comments.
I must not forget to thank my professors who generously shared their expertise.
Also, I especially thank the master department director Mr. Mahmoud Naji.
I also would like to thank my family and all my friends especially Hakim Mkacher and
Hamdi Zroud for their support and help.
Thank you ALL!

Contents
Acknowledgments i
List of gures v
List of tables vii
List of Abbreviations xi
Introduction 1
I Related Work 3
1 Linguistic Resources 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Speech and Text Data Collection . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Arabic Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Modern Standard Arabic Corpora . . . . . . . . . . . . . . . . . . 6
1.3.2 Dialectal Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Orthographic Transcription . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Transcription Software . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Transcription Guidelines . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Morpho-Syntactic Annotation 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Morpho-syntactic Annotation Methods for MSA Language . . . . . . . . 13
2.3 Morpho-syntactic Annotation Methods for Dialects Arabic Language . . 16
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
II Proposed Method 19
3 Data Collection and Transcription 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Speech Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

iv Contents
3.3 Transcription Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Transcription Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.2 Transcribing Guidelines . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Morpho-syntactic Annotation Method 33
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Our main Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Preliminary Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Word analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Choosing results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 Result le generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Realization and Performance Evaluation 45
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 Al-Khalil analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.2 MADA analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.3 Al-Khalil TD version . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.4 Tunisian Dialect Dictionary . . . . . . . . . . . . . . . . . . . . . 46
5.3 Tunisian Dialect Annotation Tool . . . . . . . . . . . . . . . . . . . . . . 46
5.3.1 Process of TDAT . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.2 Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.3 TDAT Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4.1 Gold standard for the TD language . . . . . . . . . . . . . . . . . 58
5.4.2 Evaluate the TDAT . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Conclusion 63
A TD Enriched Orthographic Transcription 65
A.1 Inter-pausal units segmentation . . . . . . . . . . . . . . . . . . . . . . . 65
A.2 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2.1 Typographic rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2.2 Pronunciation notation . . . . . . . . . . . . . . . . . . . . . . . . 67
A.2.3 Liaisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Contents v
A.2.4 Non Arabic phonemes . . . . . . . . . . . . . . . . . . . . . . . . 68
A.2.5 Reported speech . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.2.6 Incomprehensible sequences . . . . . . . . . . . . . . . . . . . . . 68
A.2.7 Laughers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.2.8 Pauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
B LCD Commitment 71
Bibliography 73

List of Figures
3.1 The proportion of themes in the corpus . . . . . . . . . . . . . . . . . . . 22
3.2 Sections and Turns in Transcriber . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Manage speakers in Transcriber . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Manage speakers in Praat . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Morpho-syntactic annotation steps in our method . . . . . . . . . . . . . 34
4.2 Example of the Segmented Text . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Analyzing process of a TD word . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Analyzing a word with TD dictionary . . . . . . . . . . . . . . . . . . . . 40
5.1 XML Shema of the Annotated Text . . . . . . . . . . . . . . . . . . . . 48
5.2 Main Interface in the TDAT . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Transcription window in the TDAT . . . . . . . . . . . . . . . . . . . . . 53
5.4 Segmented Text window in the TDAT . . . . . . . . . . . . . . . . . . . 54
5.5 Add an analysis result window in the TDAT . . . . . . . . . . . . . . . . 54
5.6 Analyze window in the TDAT . . . . . . . . . . . . . . . . . . . . . . . . 55
5.7 Analyse Details window in the TDAT . . . . . . . . . . . . . . . . . . . 56
5.8 Annotation options window in the TDAT . . . . . . . . . . . . . . . . . 57
5.9 Analysis Result File window in the TDAT . . . . . . . . . . . . . . . . . 58

List of Tables
3.1 Corpus les content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Description of tags used in the references XML le . . . . . . . . . . . . 24
3.3 Clitics in the TD language . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Description of tags used in the segmented text le . . . . . . . . . . . . . 35
4.4 Description of tag used in the annotation result le . . . . . . . . . . . . 42
4.1 Annotations extracted from the transcription les . . . . . . . . . . . . . 43
4.3 Levenshtein distance table example . . . . . . . . . . . . . . . . . . . . . 44
5.1 Description of icons used in the annotation interface . . . . . . . . . . . . 54
5.2 Evaluation results of the word segmenter module . . . . . . . . . . . . . 59
5.3 Evaluation results of the TD dictionary module using the Levenshtein
distance function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Evaluation results of the TDAT . . . . . . . . . . . . . . . . . . . . . . . 61

List of Abbreviations
AMADAT Arabic Multi-Dialectal Transcription Tool
AOC Arabic Online Commentary Dataset
CES Corpus Encoding Standard
DTD Document Type Denitions
ECA CALLHOME Egyptian Arabic Speech
ELAN EUDICO Linguistic Annotator
FBIS Foreign Broadcast Information Service
HMM Hidden Markov Model
HPL Hewlett-Packard Laboratories
ICA Iraqi Colloquial Arabic
LA Levantine Arabic
LATB Levantine Arabic TreeBank
LDC Linguistic Data Consortium
MSA Modern Standard Arabic
NLP Natural Language Processing
OSAC Open Source Arabic Corpora (Updated)
OSAc Open Source Arabic Corpus
POS Part of Speech
POST Part of Speech Tagging
SAAV B Saudi Accented Arabic Voice Bank
SAMPA Speech Assessment Methods Phonetic Alphabet
STT Speech-to-Text
TD Tunisian dialects
TDAT Tunisian Dialect Annotation Tool
TEI Text Encoding Initiative
XML Extensible Markup Language
XSL eXtensible Stylesheet Language

Introduction
The Arabic language is speaking by about 300 million people [Al-Shamsi 2006] and
the fourth most spoken language. Thus, it is a major international modern language.
Considering the amount of people it is spoken by, the number of computing resources for
the Arabic language is still few. The Arabic language is a blend of the modern standard
Arabic, used in written and formal spoken discourse, and a collection of related Arabic
dialects. This mixture was dened by Hymes [Hymes 1973] as a linguistic continuum.
Indeed, Arabic dialects present signicant phonological, morphological, lexical, and syn-
tactic dierences among themselves and when compared to the standard written forms.
Furthermore, the presence of diglossia [Ferguson 1959] is a real challenging issue for
Arabic speech language technologies, including corpus creation to support Speech-to-
Text (STT) systems. In addition, other diculties for researchers lie in Arabic dialects
as a morphologically complex language. Also, there is a small amounts of text data for
spoken Arabic due to the non ocially written of this language.
A better set of corpus will support, in the rst place, further research into the area for
example support linguistics researchers in analysis of the Arabic dialects phenomenas. It
also lays the ground for creating new and a better end-user application.
One of the fundamental parts of any application of the Natural Language Processing
(NLP)in a specic language, such as in the Tunisian Dialect (TD), is the existence of
corpora. Indeed, the construction of speech corpora for the TD is fundamental for study-
ing its specication and to advance its NLP application for example speech recognition.
Some small corpus exist nowadays and have been developed by previous researchs. How-
ever, these corpus combined other dialects and they are not specic for the TD also they
do not include any diacritics information. In general, these corpus are either in closed
projects or not freely available. In addition, these corpora do not include any morpho-
syntactic annotation or phonetic information.
The aim of this project is to investigate how to collect and transcribe speech data, the
possibility of using existing tools for transcribing, the choice of the appropriate guidelines.
How to annotate the transcripts?, which methods are used?,
The report is divided in two parts. The rst part presents the state of the art of ex-
isting speech corpora resources for Arabic language. The chapter two lists some morpho-
syntactic annotation methods. The second part describes the used method and resources
to collect, transcribe, and annotate speech data. The third chapter is devoted to present

2 Introduction
steps that we have followed to collect and transcribe speech data. In the fourth chapter,
we present our method to achieve the morpho-syntactic annotation task. The last chapter
is devoted to present the used tools and resources, the developed tool and the obtained
results. Finally, a conclusion that summarizes the results of our work and presents some
future prospects that will be given at the end of this report.

Chapter 1
Linguistic Resources
1.1 Introduction
A spoken language corpus is dened as a collection of speech recordings which is ac-
cessible in computer readable form and which comes with annotation and documentation
sucient to allow re-use of the data in-house, or by scientists in other organizations
[Gibbon 1997]. Indeed, creating speech corpora is crucial for the studying of dierent
characteristics of the spoken language. As well as the development of applications which
deal with the voice, for example speech recognition system.
In this chapter, we will answer the following question: How to create a speech corpus?
Therefore, we present methods of speech and text data collection in the next section.
Then, we focus on presenting some available corpora for Arabic language. After that,
we introduced the orthographic transcription task by presenting a literature recap of its
guidelines and tools.
1.2 Speech and Text Data Collection
A prerequisite for a successful development of spoken language resources is a good
denition of the collected speech data. There are three steps in text data collection:
The rst is to precise the source of data (books, novels, chat rooms, etc), type (standard
written language or dialectal), theme (social, news, sport, etc), encoding and format
(transcribed les, web pages, XML les, etc). The second step called data collection,
which is performed using dierent techniques such as harvesting large amounts of data
from the web [Diab 2010] and speech data transcribing [Messaoudi 2004]. As well as an
automatic speech recognition method like used in [Messaoudi 2004, Gauvain 2000] could
be performed to extract text from speech data. The third step consists in adapting and
organizing these data [Diab 2010].
Consequently, speech data collection follows the same steps as text collection. In addition,
speech data with their dierent types (audio or video) and formats (mp3, wave, avi, etc),

6 Chapter 1. Linguistic Resources
are collected with dierent ways. The easier one, is to download streaming videos and
audios from the Internet. Unfortunately, in this method, we couldn't guarantee the
good quality of data. Otherwise, we need to refer to recording method where we could
x subjects and speakers dialects as we wish. At the same time, we could ensure a best
quality of data. However, we need funding to pay speakers and to buy specic equipments
for recording.
Some annotation tools [Kipp 2011] give the possibility to access directly to the broad-
cast data by using the associated Uniform Resource Locator (URL) which overcame the
step of collection. Regrettably, this feature is currently not available in the used voice
annotation tools.
1.3 Arabic Corpora
The Arabic language is composed of a collection of standard written language (Modern
Standard Arabic) and spoken dialects lanquage. The Arabic dialects are used extensively
in almost all everyday conversations. Therefore, they have a considerable importance.
However, owing to the lack of data and of poor resources, Natural Language Processing
(NLP) technology for Arabic dialectal is still in its infancy. Therefore, basic resources,
tokenizers, and morphological analyzers which are developed for Modern Standard Arabic
(MSA), are yet virtually non-existent for dialectal.
1.3.1 Modern Standard Arabic Corpora
There are many research projects involved in MSA corpora's development such as the
updated version of the Open Source Arabic Corpora (OSAC) described in [Saad 2010]
which include corpora; The British Broadcasting Corporation (BBC) Arabic corpus col-
lected from bbcarabic.com, the Cable News Network (CNN) Arabic corpus collected from
cnnarabic.com, and the Open Source Arabic Corpus (OSAc) collected from multiple sites.
The OSAC corpus contains about 23 Mega after removing the stop words.
Foreign Broadcast Information Service (FBIS) corpus is another corpus for MSA cre-
ated by [Messaoudi 2005] and used in [Vergyri 2004]. The data set comprises a collection
of radio newscasts from various radio stations in the Arabic speaking world (Cairo, Dam-
ascus, Baghdad) totalling approximately 40 hours of speech roughly 240K words. The
transcription of the FBIS corpus was done in Arabic script only and does not contain
any diacritic information.

1.3. Arabic Corpora 7
The Linguistic Data Consortium (LDC) provided the Penn Arabic Treebank
[Duh 2005] which is a data set of newswire text from Agence France Press, An Na-
har News, and Unmah Press transcribed in standard MSA script. There are more than
113,500 tokens which are analyzed and provided with disambiguated morphological in-
formation.
1.3.2 Dialectal Corpora
At present, the major standard dialect corpora are available through the LDC by the
DARPA EARS (Eective, Aordable, Reusable Speech-to-Text) program where develop-
ing robust speech recognition technology to address a range of languages and speaking
styles, and which includes data from the Egyptian, Levantine, Gulf, and Iraqi dialects.
Also, the LDC provides conversational and broadcast speech with their transcripts.
The Levantine Arabic (LA) is presented by the Levantine Arabic QT Training Data
Set [Maamouri 2006a] which is a set of phone conversations of Levantine Arabic speakers
[Duh 2005]. Furthermore, the data set contains approximately 250 hours of telephone
conversations. About 2000 successful calls have been collected which are distributed in
terms of regional dialect (Levantine, Egyptian, Gulf, Iraqi, Moroccan, Saudi, Yemeni).
Also, LA provided both the conversations speech and their transcript.
Moreover, another Arabic colloquial corpora called CALLHOME Egyptian Arabic
Speech (ECA), which has been used in [Duh 2005, Gibbon 1983], was dedicated to the
Egyptian dialect. Indeed, the data set consists of 120 telephone conversations between
native speakers of Egyptian dialect. In fact, ECA corpus contains both dialectal and
MSA words forms. Also, ECA was accompanied by a lexicon containing the morpholog-
ical analysis of all words, an analysis in terms of stem and morphological characteristics
such as person, number, gender, POS, etc.
Last but not least, the Saudi Arabic dialect was represented by the Saudi Accented
Arabic Voice Bank (SAAV B) [Alghamdi 2008] which is very rich in terms of its speech
sound content and speaker diversity within the Saudi Arabia. The duration of the total
recorded speech is 96.37 hours distributed among 60,947 audio les. Indeed, SAAV B
was externally validated and used by IBM Egypt Branch to train their speech recognition
engine.

English-Iraqi corpus is another Arabic corpora mentioned in [Precoda 2007] and which
consists of 40 hours of transcribed speech audio from DARPA's Transtac program.
Many small corpora were developed in order to satisfy specic needs like described
in [Al-Onaizan 1999, Ghazali 2002, Barkat-Defradas 2003] where ten speakers originally
from the western zone (Egyptian Arabic, Syrian, Lebanon and Jordan) and Moroccan
Arabic area (Algeria and Moroccan) who are listening to the story the north wind and
the sun in French and translated spontaneously into their dialects.
Some other research projects are limited to the collection of dialectal text data, for
example the Arabic Online Commentary Dataset (AOC) mentioned in [Zaidan 2011]
created with crawling the websites of three Arabic newspapers (Al-Ghad, Al-Riyadh,
Al-Youm Al-Sabe), the commentary data consists of 52.1M words and includes sentences
from articles in the 150K crawled webpages. In fact, 41% of the data content had dialectal
words.
Additional to the AOC, an Arabic-Dialect/English Parallel Text was developed by
Raytheon Bolt, Beranek and Newman (BBN) technologies, LDC and Sakhr Software.
Infact, this corpus contains approximately 3.5 million tokens from Arabic dialect sentences
with their English translations. Data in this corpus consist of Arabic web text that was
ltered automatically from a large Arabic text corpora provided by the LDC.
1.4 Orthographic Transcription
The acoustic signal of audio content may correspond to speech, music or noise, but also
mixtures of speech, music and noise. In addition to that, there is a variety of speakers and
topics in the same record. Indeed, transcribers can work on a given subject successively
or simultaneously. The sound quality of the recording (delity) may vary signicantly
over time.
Dierent stages of transcription work are: the segmentation of the soundtrack, the
identication of turns and speakers, the orthographic transcription, and verication. De-
pending on the choice of the transcriber, these steps can be conducted in parallel or
sequential manner instead of long portion of the signal.
The diculty of transcribing depends on the number of speakers involved in the record-
ings and their clear pronunciation. Processing many les in quick succession does not
make the work faster as the exhaustion slows down the process. Also, it is preferable

1.4. Orthographic Transcription 9
to take a rest between each le. In [Al-Sulaiti 2004] the average time for transcribing
a ve-minute Arabic spoken record for a non-professional typist, without including any
enriched orthographic annotation, is 1:50:42. The average time is the average between
the shortest and the longest time taken in transcription.
The step of annotation aim at structure the recording, which means to be segmented
and to describe the acoustic signal at dierent levels deemed relevant for the further
processing. The transcription could not reect neither the audio record perfectly nor
pronunciation of a same subject/term, and could be the subject of a deepen study on
semantic and syntax, pronunciation analysis.
Manual transcription of audio recordings such as radio or television streaming, will ad-
vance research in automatic transcription, indexing and archiving. Indeed, the transcrip-
tion will provide a linguistic resource and data that make possible the construction of an
automatic recognition system, which will be then used to produce automatic transcrip-
tion.
1.4.1 Transcription Software
There are dierent types of tools for labelling and annotation of speech corpora.
Some of them are addressed for audio formats such as Transcriber [Barras 2000], Praat
[D.Weenink 2013], SoundIndex1
, and AMADAT 2
, and other for video formats for exam-
ple Anvil [Kipp 2011] and EUDICO Linguistic Annotator (ELAN) [Dreuw 2008].
• Transcriber is a free software and has been used in many projects such as
[Messaoudi 2005, Piu 2007, Fromont 2012] and becoming very popular due to its
simplicity and eciency as it provides in an easier way transcribing and labelling.
• Praat is a productivity tool for phoneticians. It allows speech analysis, synthesis, la-
belling and segmentation, speech manipulation, statistics, learning algorithms, and
manipulation package includes statistical stu,produces publication-quality graph-
ics.
• SoundIndex is a tool that allows user to write tags audio at any level in the
hierarchy of an XML le by setting values for attributes such as the start and the
end of audio in the sound editor. The interpretation of tags audio is written in
XSL.
1Software documentation on http://michel.jacobson.free.fr/soundIndex/Sommaire.htm.
2AMADAT User Guidelines on http://projects.ldc.upenn.edu/EARS/Arabic/EARS_AMADAT.htm.

• Arabic Multi-Dialectal Transcription Tool (AMADAT) allows transcribing speech,
and gives a very helpful functionality that provides a correction level.
• Anvil is a free video annotation tool. It oers frame-accurate, hierarchical multi-
layered annotation driven by user-dened annotation schemes. The color-coded
elements which have been noted on multiple tracks in time-alignment had been
shown by the initiative board. Also, special features are cross-level links, non-
temporal objects and are as project tool for managing multiple annotations. It
allows the importation of data from the widely used, public domain phonetic tools
Praat and XWaves. Anvil's data les are XML-bad.
• ELAN is an annotation tool that allows creating, editing, visualizing and search-
ing annotations for video and audio data. This software aims to provide a sound
technological basis for the annotation and exploitation of multi-media recordings.
Adding to that, ELAN is specically designed for the analysis of language, sign
language, and gesture, yet it can be used in media corpora, with video and/or audio
data, for purposes of annotation, analysis and documentation.
1.4.2 Transcription Guidelines
The transcription process follows specic agreements to provide structured records in
thematic content, speakers, and other speech information. These tools produce informa-
tion which are called annotations. Nowadays, a lot of conventions have been made in
NLP projects to satisfy the need of homogeneous transcription manner and to provide
annotation enrichments. Generally, these conventions are depending on the used speech
data format and on the transcription tool.
When a speech corpus is transcribed into a written text, the transcriber is immediately
confronted with the following question: What are the reects of the oral speech realty in
a corpus?
A set of rules for writing speech corpora are designed to provide an enriched ortho-
graphic transcription. These conventions establish the annotated phenomena. Numerous
studies have been carried out in prepared speech, as for example for broadcast news
[Cam 2008].
However, conversational speech refers to an activity more informal, in which partici-
pants have constantly managed topic, dened speakers and distinguished speech turns
which correspond to changes in speaker [Gro 2007, Cam 2008, André 2008]. As a conse-
quence, numerous phenomena appear such as hesitations, repeats, feedback, backchan-

1.4. Orthographic Transcription 11
nels, etc. Other phonetic phenomena such as non-standard elision, reduction phenomena
[Meunier 2011], truncated words, and more generally, non-standard pronunciations are
also very frequent. All these phenomena can impact on the phonetization.
Hence, identifying dierent types of pause; long pause between turn-taking and short
pause between words, is very useful for a further purpose such as the development of a
voice recognition system [Alotaibi 2010].
In [Gro 2007] conventions focus on segment structures presented by elongation, trun-
cation, aspirations, sigh. The spontaneous oral production is a real problem in term of
annotation awing to according to [Shriberg 1994]:
Disuencies show regularities in a variety of dimensions. These regularities
can help guide and constrain models of spoken language production. In addi-
tion they can be modeled in applications to improve the automatic processing
of spontaneous speech.
Another denition of disuencies in [Piu 2007]:
Disuencies (repeats, word-fragments, self-repairs, aborted constructs, etc)
inherent in any spontaneous speech production constitute a real diculty in
terms of annotation. Indeed, the annotation of these phenomena seems not
easily automatizable, because their study needs an interpretative judgement.
In fact, there are dierent types of disuencies as described in [Piu 2007]:
• Repetition disuency is among the most frequent types of disuency in conversa-
tional speech (accounting for over 20% of disuencies), according to [Cole 2005]:
Repetition disuencies occur when the speaker makes a premature com-
mitment to the production of a constituent, perhaps as a strategy for
holding the oor, and then hesitates while the appropriate phonetic plan
is formed.
• Self-correction as described in [Kurdi 2003] is the substitution of a word or series
of words from an others, to modify or correct a part of the statement.
• In [Pallaud 2002] high lighted primers as an interruption morpheme being enuncia-
tion. Generally, disuencies could be combined simultaneously with the association
of at least two of phenomena mentioned above.

In [Dipper 2009]:
Transcription guidelines specify how to transcribe letters that do not have a
modern equivalent. They also specify which letter forms represent variants
of one and the same character, and which letters are to be transcribed as
dierent characters.
In [Cam 2008] there are numerous transcription rules related to the speech-text such
as how to write letters, punctuation, numbers, Internet addresses, acronyms, spelling,
abbreviations, hesitations, repetitions, truncation, absent or unknown words. Also a
list of markups was used to identify noise, pronunciation problems, back channel, and
comments.
In [André 2008] the specic pronunciations were recorded with the SAMPA phonetic
alphabet. General roles for transcribing the short vowels spelling, and diacritics have
been presented by The LDC Guidelines For Transcribing Levantine Arabic
3
.
1.5 Conclusion
Despite all the attempts made by the LDC and the other research projects to provide
speech corpora for Arabic dialects, some languages like Arabic Tunisian Dialects (TD)
still needs more improvements in corpus construction. Yet, these attempts knew several
challenges. Some of them are related to Arabic language and to general NLP issue.
Moreover, there are some problems that could be noticeable during the process of speech
transcription such as the ambiguity of the word transcription. Another problem accrued
when multiple sound sources are present, then it is necessary to focus on the transcription
source most emerging. As well, when two speakers are talking in the foreground, we could
transcribe both through the mechanism of super imposed speech.
3The Guidelines are available on the LDC website http://ldc.upenn.edu/Projects/EARS/Arabic/
www.Guidelines_Levantine_MSA.htm

Chapter 2
Morpho-Syntactic Annotation
2.1 Introduction
A transcript could be annotated by adding linguistic information for each word. In
[Sawaf 2010] a denition of linguistic Annotation is: Corpus annotation is the practice
of adding interpretative, especially linguistic, information to a text corpus, by coding
added to the electronic representation of the text itself. Indeed, grammatical tagging is
the task of associating a label or a tag for each word in the text to indicate its grammatical
classication.
Generally speaking, the morpho-syntactic annotation process relies on word structure
as described in [Al-Taani 2009]. Accordingly, patterns and axes are used to deter-
mine the word grammatical class following a dened rules. Moreover, this process
diers according to the used data and resources. Therefore, approaches used in the
morpho-syntactic annotation process vary from supervised to unsupervised as described
in [Jurafsky 2008].
In the following sections we introduce morpho-syntactic annotation methods for the Mod-
ern Standard Arabic (MSA) language and Tunisian Dialect (TD) languages.
2.2 Morpho-syntactic Annotation Methods for MSA
Language
The morpho-syntactic annotation process for MSA has been performed using dif-
ferent approaches such as statistical approach [Al-Shamsi 2006] and learning approach
[Bosch 2005]. Recently, some works combine approches to improve the performance of
the developped tagger. In the following, a description of some works.
The statistical approach was used in the work of [Al-Shamsi 2006] to handle the POST
of Arabic text. Indeed, the developed method was based on the HMM and followed these

14 Chapter 2. Morpho-Syntactic Annotation
steps:
1. Creation of a set of tags,
2. Employment of Buckwalter's stemmer to stem Arabic text from the used corpus
which contains 9.15 MB of native Arabic articles,
3. Manually correction of tagging errors,
4. Design and construction of an HMM-based model of Arabic POS tags,
5. Training of the developed POST on the annotated corpus.
The proposed method achieved an F-measure score of 97%.
Another tagging system for Arabic POST was proposed by [Hadj 2009]. The POS
tagging task of the system is based on the sentence structure and combines morphological
analysis with HMM:
- The morphological analysis was aimed to reduce the size of the tags lexicon by seg-
menting words in their prexes, stems, and suxes.
- The HMM was used to represent the sentence structure in order to take in account
the logical linguistic sequencing.
Each possible state of the HMM presents a tag. The transitions between those states
are governed by the syntax of the sentence.
The training corpus data is composed of some old texts extracted from Books of third
century. These data were manually tagged using the developed tagset. The system
evaluation was based on the same corpus. The obtained result achieves a recognition rate
of 96%, which is considering as very promising compared to the size of tagged data.
In addition to the statistical approaches, recent applications tend to explore the use of
a machine learning methods to handle the Arabic morphology and POS tagging process.
Indeed, a memory-based learning approach was developed by [Bosch 2005] in order to
morphologically analysis and part-of-speech tagging of written Arabic. The learning
classication task in the memory-based learning was performed by employing the k-
nearest neighbor classier that searches for the k nearest neighbors. The memory-based
learning, which is a supervised inductive learning algorithm, treats a set of labeled training
instances as points in a multi-dimensional feature space. Then, stores these instances as
such in an instance base in memory. Furthermore, [Bosch 2005] employed a modied

2.2. Morpho-syntactic Annotation Methods for MSA Language 15
value dierence metric distance function to determine the similarity of pairs of values
of a feature. The used metric explores the conditional probabilities of the two values
conditioned on the classes to determine their similarity.
In order to train and test the developed approach, [Bosch 2005] exploited the Arabic
Treebank 1 (version 2.0) corpus which is consisting of 166,068 tagged words. Evaluat-
ing the morphological analyzer was based on predicting the part-of- speech tags of the
segments, the positions of the segmentations, and all letter transformations between the
surface form and the analysis. The obtained results in term of precision, recall, and F-
score are consequently 0.41, 0.43 and 0.42. Also, the POS tagger attained an accuracy of
66.4% on unknown words, and 91.5% on all words in held-out data.
Furthermore, combining the morpho-syntactic analysis generated from the morpholog-
ical analyzer and the part-of-speech predicted by the tagger yields a joint accuracy of
58.1%. This accuracy represents the correctly predicted tags and corresponds to the full
analysis for unknown words. The main limitation of the memory-based learning approach,
as concluded in [Bosch 2005], was its inability to recognize the stem of an unknown word
and accordingly the appropriate vowel insertions.
Another approach combine statistical and rule-based techniques was introduced by
[Khoja 2001] in order to construct an Arabic part-of-speech tagger. First, the developed
approach is based on: The use of traditional Arabic grammatical theory to determine
rules used while stemming a word. These rules are used to determinate stem or root by
removing axes (prexes, suxes and inxes). Second, this approach used lexical and
contextual probabilities. The lexical probability is the probability of a word having a
certain grammatical class. Whereas the contextual probability is the probability of one
tag following another tag. These probabilities are calculated from the tagged training
corpus.
The used method consists on searching a word in the lexicon to determine its possible
tags. The words not found in the lexicon are then stemmed using a combination of axes
to determine the tag of the word. Finally, in order to disambiguate ambiguous words and
unknown words, [Khoja 2001] used a statistical tagger that was based on the Viterbi
algorithm [Jelinek 1976].
In order to train the tagger and to construct the lexicon, [Khoja 2001] used a corpus
that contains 50,000 words in Modern Standard Arabic extracted from the Saudi Al-

Jazirah newspaper. That is manually tagged. The constricted lexicon contains 9,986
words. In order to test the developed tagger, four corpora (85159 words) were collected
from newspapers and papers in social science. In addition to MSA words the test corpus
contained some colloquial words. The statistical tagger achieved an accuracy of around
90% when disambiguating ambiguous words.
Furthermore, [Khoja 2001] used Arabic dictionary (4,748 roots) to test the developed
stemmer. The obtained result in terms of accuracy achieves 97%. As well the unana-
lyzed words are generally foreign terms and proper nouns, as incorrect writing words,
[Khoja 2001] conclude that employing a pre-processing component could solve the prob-
lem.
2.3 Morpho-syntactic Annotation Methods for Di-
alects Arabic Language
In [Maamouri 2006b] a description of a supervised approach was used to annotate di-
alect Arabic data. Indeed, a word list of the Levantine Arabic Treebank (LATB) data
was used to manually annotate the most frequent surface forms. Then, perform the pat-
tern matching operations to identify potential new prex-stem-sux combinations among
the remaining unannotated words in the list.
Furthermore, the Morphological/Part-of-Speech/Gloss (MPG) tagging included morpho-
logical analysis, POST, and glossing.
The evaluation of the developed system was based after and before by using dictionary.
The evaluation result shows that there is more than 10% reduction of annotation error.
Another supervised approach in [Duh 2005] was designed for tagging the dialectal Ara-
bic using LCA data. The developed system is based on using a statistical trigram tagger
in the form of a HMM and Baseline POST. Indeed, a statistical modeling and cross-
dialectal data sharing techniques were used to enhance the performance of the baseline
tagger. The adopted approach requires only original text data from several varieties of
Arabic and a morphological analyzer for MSA. So, there is no dialect-specic tools were
used.
To evaluate the developed system, [Duh 2005] compare between the obtained results
with those received ones when using:

2.3. Morpho-syntactic Annotation Methods for Dialects Arabic Language17
- a supervised tagger which were trained on hand-annotated data.
- a state-of-the-art MSA tagger applied to Egyptian Arabic.
As a result, there is 10% improvement of the ECA tagger.
Additionally to the supervised approaches, other projects tend toward the use of un-
supervised approach for example [Chiang 2006].
An Arabic dialects parser was described in [Chiang 2006] where three frameworks were
constructed for leveraging MSA corpora in order to parse LA. This process was based
on knowledge about the lexical, morphological, and syntactic dierences between MSA
and LA .
[Chiang 2006] evaluated three methods:
• Sentence transduction : in which the LA sentence to be parsed is turned into an
MSA sentence and then parsed with an MSA parser;
• Treebank transduction : in which the MSA treebank is turned into an LA treebank;
• Grammar transduction : in which an MSA grammar is turned into an LA grammar
which is then used for parsing LA.
The used MSA treebanks data, comprises 17,617 sentences and 588,244 tokens. Indeed,
it is composed of four dierent lexicons: Small lexicon with uniform probabilities, small
lexicon with EM-based probabilities, big lexicon with uniform probabilities, and big lex-
icon with EM-based probabilities.
To evaluate the developed parser, [Chiang 2006] used data that comprise 10% MSA tree-
banks and 2051 sentences and 10,644 tokens from the Levantine treebank LATB. The
major limitation in this method as concluded in [Chiang 2006] is the non employment of
a demonstration of cost-eectiveness.
Another approach described in [Al-Sabbagh 2012] used a function-based annotation
scheme in which words are annotated based on their grammatical functions. Indeed, the
grammatical categories in which morpho-syntactic structures and grammatical functions
could be dierent from each other. The developed method was based on the implemen-
tation of a Brill's Transformation-Based POS Tagging algorithm.
The developed tagger was trained on a manually-annotated Twitter-based Egyptian Ara-
bic corpus which is composed of 22,834 tweets and contains 423,691 tokens. In order to
evaluate the developed POS tagger, ten cross validation were performed. The obtained

results in term of F-measure are 87.6% for the task of POS tagging without semantic
feature labeling and 82.3% for the task of POS tagging with tokenization and semantic
features. The problem faced while analyzing is related to the three-letter and two-letter
words which are highly ambiguous and they can have multiple readings based on the
short vowel pattern.
Example:
The word ‰g
F
can be analyzed by the tagged as a:
Noun: meaning grandfather or seriousness.
Adverb: meaning seriously.
To solve this problem, [Al-Sabbagh 2012] concluded that a word sense disambiguation
modules is fundamental to improve performance on highly ambiguous words.
2.4 Conclusion
In this chapter, we introduced methods of morpho-syntactic annotation for MSA and
dialectal Arabic. The employment of approaches in morpho-syntactic annotation task
such as statistical and learning approach depend on the used resources data. Indeed, the
unsupervised techniques are not suitable for poor language resources such as the dialectal
languages. Therefore, the POS Tagging process for colloquial Arabic still needs more
improvement in term of corpus collection and annotation.

Chapter 3
Data Collection and Transcription
3.1 Introduction
The transcribing process consists of two basic steps: The rst one is performed to pro-
vide voice data in order to be transcribed later. The second step consists in transcribing
the voice data by following directives that we have established. Indeed, to allow a better
representation of spontaneous speech phenomena, these directives take in consideration
the TD transcript specication. More details about those two steps are described in the
following sections.
3.2 Speech Collection
The aim of this section is to provide speech data which is the rst step in corpus
creation. Choice of the speech data content and type is very important and could be
the key of further use of our corpus. That is to say, we choose to provide both audio
and video speech to improve the use of our corpus in new research tendency especially in
video annotation [Kipp 2011].
Furthermore, including dierent TD (Sfaxian dialect, Sahel dialect, etc) will improve
the representativeness of the TD in our corpus. To provide speech data, we used broadcast
conversational speech to be the main source of speech data in our corpus as used in these
two projects [Lamel 2007] and [Belgacem 2010]. These streaming are generally radio and
television talk shows, debates, and interactive programs where the general public are
invited to participate in discussion by telephone.
In general, the commune conversational dialect in Tunisia is the dialect of the capital,
the same used in national TV and radio stations and by the majority of educated people.
Consequently, we have allocated the largest part of our corpus to this dialect.

22 Chapter 3. Data Collection and Transcription
Providing speech data with a variety of themes will increase the size of the vocabulary
in our corpus and will be very useful for further application for example theme classica-
tion [Bischo 2009]. Indeed, we dened the following theme's list in our data selection:
Religious, Political, Cooking, Health, and Social. The latter could include record-
ings that refer to more than one theme. We also dene the Other tag to assign other
types of themes.
Figure 3.1 shows the proportion of each theme in our corpus.
Figure 3.1: The proportion of themes in the corpus
Having a good amount of spoken recordings is fundamental in the design of the corpus.
Also, a high sound quality is required and will be useful for other future processing
for example in voice recognition system. In addition, we provided both individual and
multiple speakers in our collection to identify dierent aspects of conversational speech.
In table 3.1 a description of each transcription le in our corpus. The transcribed les
achieved a duration of 1 hour and 25 minutes and 37 seconds.
The collected data les generally have a long duration that exceeds fteen minutes; to
simplify the transcription task we split these records in order to obtain sequences with

3.2. Speech Collection 23
Table 3.1: Corpus les content
File Duration (sec) Number of Speakers Size (megabyte) Type
01 758 1 12,1 Video
02 867 3 13,9 Video
03 216 12 19.1 Audio
04 232 13 20.5 Audio
05 204 9 18.0 Audio
06 360 8 31.8 Audio
07 900 2 14.4 Video
08 364 3 5.8 Video
09 477 2 7.6 Video
10 759 2 12.2 Video
duration between ve and fteen minutes. Then we convert them to the MP3 format to
be adapted to the entry of the used transcription software.
In order to provide more details about our speech data collection, we provided the
references.xml le which contains a description of each le in our corpus. The XML
has been used to represent the data structures in this le. Such representation allows a
simple preview for users and an easier integration in feature annotation system. Adding
to that, we assigned a DTD le that we named references.dtd to validate the previous
XML le.
In Table 3.2 we nd a description of all dierent tags used in the references.xml le.

Table 3.2: Description of tags used in the references XML le
Label Type Name Unit Description
ID Integer Identier Identier of record
NAM String Name Name of record
DUR Integer Duration sec Duration of record
TOP List Topic Topic of record
NBMASP Integer Number of
male speaker
Number of
male speakers in a record
NBFESP Integer Number of
female speaker
Number of female speakers
in a record
FISI Float File Size megabyte File Size of a record record
SOFITY List Source File
Type
Source File Type of a record
(TV or Radio)
SONAM String Source Name Source Name of a record
SOFIEX String Source File
Extension
Source File Extension of a
record
SOFISI Float Source File
Size
megabyte Source File Size of a giving
record
SOTY List Source Type Source Type of a giving
record (Audio or Video)
SODA Date Source Date Source Date of a record
SOLI String Source Link Source Link of a record
3.3 Transcription Process
The speech annotation process includes the segmentation of the sound track, the iden-
tication of turns and speakers, and the orthographic transcription. We applied these
steps in a parallel manner taking in consideration notes in Orthographic Transcription
of Tunisian Arabic [Zribi 2013] and in the Directive of Transcription and Annotation
of Tunisian Dialects.
The rst voice le (duration of 12:38 sec) was transcribed in the beginning with the
Speech Assessment Methods Phonetic Alphabet (SAMPA) for Arabic. We thought that
using SAMPA for Arabic will allow a better representation of phonetics. However,
during transcribing process we nd that it is better to use the Arabic script including
diacritics instead of the SAMPA for Arabic. So we adopt the Arabic transcript in the

3.3. Transcription Process 25
transcription process.
Transcribing Arabic spoken recordings is a very long task, especially when using Arabic
script and diacritics. For example, a recording that lasts 3 minutes 52 sec, yet took more
then 4 hours. Actually, the transcribing of each minute takes at least one hour; 15-
20 minutes for the identication of turns and speakers, 40-45 minutes for transcribing,
following rules in our directives.
3.3.1 Transcription Tools
There is a variety of transcribing tools (SoundIndex, AMADAT, XWaves, etc) for voice
data. We selected Transcriber [Barras 2000] and Praat [D.Weenink 2013] to handle the
transcribing task. The software Transcriber was adopted according to the following
advantages:
Simple user interface:
• Supports many languages (including English, French and Arabic).
• Easier way to manipulate the voice interval.
• Supports the use of keyboard shortcuts in annotation.
Very rich in term of annotation:
• Dened annotation events (noise list, lexical list, named entities list, etc).
• Possibility of edit or add additional annotation.
Input le Flexibility:
• Accepts long speech File duration.
• Supports various le formats (au, wav, snd, mp3, ogg, wav, etc).
Better output representation:
• Supports many types of encoding (UTF-8, ISO-8859-6, etc).
• The output File follows a description schema.
Concerning Praat software, the choice was related to our ANLP research group needs.
In addition, Praat gives a better representation of speech overlapping and allows speech
analyzes.

3.3.2 Transcribing Guidelines
Transcription guidelines are made in order to be followed during the annotation process
of our speech data. We elaborated the Orthographic Transcription of Tunisian Arabic
directives [Zribi 2013] which are adapted from the Enriched Orthographic Transcription
(TOE in French) [Bigi 2012]. To deal with the TD, some rules have been modied or
removed. As the Standard orthographic transcription doesn't take into consideration the
observed phenomena of speech (elisions, disuency, liaison, noise, etc), we enriched our
directives with them.
The Directive of Transcription and Annotation of Tunisian Dialects directive was
made to give additional annotation concerning phonemic and phonetic in speech data.
Indeed, examples were included to show the application of these rules in the Transcriber
software. The directive was adapted from the ESTER2 convention [Cam 2008]. The
taking directive took in consideration the specicity of the Arabic language and the TD.
The following is a description of some conventions:
a) The identication of turns and speakers
- Sections :
We dened two sections in the audio document: the relevant section identied by
the title report and the not transcribed section identied by the title nontrans.
The sections not transcribed contains more than fteen seconds such as:
• Advertising, weather passages, generic emission,
• Applause,
• Music, songs,
• The beginning or the end of another show dierent from the current emission,
• Silence.
The other sections are relevant, so they are the only sections that we segment and
transcribe. The other sections are relevant, so they are the only sections that we
segment and transcribe as illustrated the example in Figure 3.2. The concept of
Sections is absent in Praat software, so we don't take in consideration these roles
while using it.
- Turn-taking
First, we identify each speaker involved in the audio document.
There are two types of speakers:

Figure 3.2: Sections and Turns in Transcriber
• Global: 1
speakers identied by the syntax First Last name.
• Local: 2
speakers identied with the syntax First Last name if possible.
Otherwise, we notice them by the syntax speaker # n; where n is the number
1 to n corresponding to the speakers order.
The same speaker must always appear with the same identier, also, the list of
speakers must contain only speakers involved in the audio document. In addition,
we complete all information relative to these speakers such as gender and dialect.
Second, we attribute the name of speaker to the Speech Turn. Although, speech
turns that do not contain speaker speech, are identied by the syntax no speaker.
The example in Figure 3.3 illustrates how to manage speakers in Transcriber soft-
ware.
To solve the problem of speech overlapping, we adapted the solution mentioned in
[Barras 2000] where we create a new speaker that we named with the syntax First
Last name speaker 1+ First Last name speaker 2.
Praat software gives a better representation of speech overlapping. In fact, the
1It is common to several audio speakers, such as the presenter, journalists, etc
2These are unknown speakers that intervene by telephone for example.

Figure 3.3: Manage speakers in Transcriber
script of each speaker is presented separately in an individual interval as shown in
Figure 3.4.
- Silence:
Silence could be at the beginning, mixed with the transcript, and at the end of a
speaker Turn. To avoid such problem, we isolated silence and noise that is beyond
0.5 seconds in a speaking turn no speaker. Also, silence beyond 0.2 at the begin-
ning of a Speaker Turn is isolated in a segment of a speech turn no speaker or
integrated directly in the last speech turn. In addition, silence above 0.2 seconds
at the end of a segment speaker was isolated in a segment of a speech turn no
speaker or integrated in the upcoming segment of the speech turn no speaker.
Eventually, we added the hashtag symbol # when we have silence between 0.1
and 0.2 seconds in the relevant turn.
- Segments:
A relevant segment contains an intervention of a speaker and must have a minimum
of syntactic and semantic consistency. If a segment of a speech term exceeds fteen
seconds we redistributed it in relevant segments.

Figure 3.4: Manage speakers in Praat
b) Orthographic transcription
To transcribe the TD, we use the modern standard Arabic transcription rules that
do not aect the characteristics of the dialect. We dene also a set of rules that allow
transcription of the words in Tunisian Arabic, based on its phonology.
- Transcription of the Hamza:
Transcribe the Hamza only if it is pronounced. Use only one of these forms of
Hamza:

d,

d, d

. If the absence of the Hamza in the dialect word causes ambiguity,
we should transcribe it.

- Transcription of ta marbuta:
The ta marbuta (

è) should be written at the end of the word whether it is pro-
nounced /a/ or /t/.
Example:

ék e

©
®

u (an apple), É
©
®¢Ë d

éƒ d

» (book of child)
- Code switching:
MSA, TD and foreign languages coexist in the daily speech of Tunisian people.
The transcription of the MSA words should respect the transcription conventions
of the Arabic language.
The foreign words and MSA words should be written respectively using this form:
[lan:X, text or word, SAMPA pronunciation] and [lan:ASM, text or word]. We use
the SAMPA of the specic language for writing the speaker's pronunciation.
Example:
• MSA word: [lan:MSA, r
F
e

t» ]
• French word: [lan:Fr, informatique, ?anfurmati:ku]
• English word: [lan:En, network, na:twirk]
- Atypical accords:
We choose the standard orthographic transcription of the words as they have been
said.
Example:
ù

ë eu
F

é« g
©
m
Ì
9d (the holidays are wonderful) instead of

ét

ë eu
F

é« g
©
m
Ì
9d
- Personal pronouns:
Pronouns must be transcribed as found in this list:
e
©
u d (I), ú

æ
©
u d

(You), Õ

æ
©
u d

/ eÓñ

t
©
u d

(You), e
©
tm

©
9
/ e
©
tk

d (We), ñë (He), ù

ë (She), eÓñë (They)
- Names of months and days:
The names of the months and days must be transcribed as in the MSA language.
- Axes and clitics:
The Table 3.3 lists the dialectal clitics.
Note: the È d should be written even if it is pronounced l.
This rule is practical also for transcribing words which start with a sun or moon
letter.
- Named entities:

Table 3.3: Clitics in the TD language
Clitic Enclitic
Pronominal enclitic ¼, è, ð, eë, Ñë, Õ», e
©
u, ø

Negation enclitic
€
Interrogation enclitic ú

æ

…
Proclitic ð, È, r
F
, ¼, ¨, È d, Ð
We used these representations for annotating t
%

…
, PERSON NAME , ½Ó ,
Place name.
Examples:
t
%

…
, ú

q
F
m
F
Ì
9d ˆ eÔ
«

- Characters:
The following phonemes /v/, /g/ and /p/ does'nt exist in the Arabic language. To
transcribe them we add ' after these letters.
- Incorrect word:
When the speaker replaces a letter with an incorrect one, we keep the original letter
and we add to it the corresponding correct one. We have to represent these updates,
as the following example:
x{Correct letter, Original letter}x, x is a letter of the word.
c) Rules of marking
Transcribe what is heard (hesitation, repetition, onomatopoeia, etc). The transcript
should be close to the signal.
- Noise:
We insert the tag [i] to indicate breathing-inspiration or [e] to indicate breathing-
expiration of the speaker. Insert tag [b] to indicate a noise:
• Mouth noises (cough, throat noise, laughters, kisses, whisper, etc),
• Rustling of papers,
• Microphone noise.
Insert tag [musique] to indicate music.

- Punctuation:
Punctuate the text with only those punctuations: ., !, ?
3.4 Conclusion
A mixture of streaming TV and radio station programs have been collected and adapted
crossing the speech data collection process described in this chapter. As a result, 10
les with a more than 1 hour and 25 minutes were transcribed following our guidelines
transcription. In fact, the transcribing process was the most laborious stage in the project.
During this process we faced some problems with the used transcribing tools such as the
slowness of Praat interface while transcribing long audio duration and the inappropriate
updating of Arabic script transcription in the Transcriber interface.

Chapter 4
Morpho-syntactic Annotation Method
4.1 Introduction
Speech and text resources are very rare which is seen as an obstacle in developing
application in this eld. In this context, our project will contribute in providing resources
by constructing a morpho-syntactic annotated speech corpora for the TD.
During this chapter, we will focus on dierent phases of the annotation task which
aims to identify the grammatical class of each word. Our method consists in integrating
dierent tools and resources to annotate the TD language words. Indeed, we choose two
morphological analyzers (MADA and Al-Khalil TD version analyzer) to analyze MSA
language words. We use also Al-Khalil TD version and TD dictionary for analyzing TD
words. We will present in the rst section a global view of our method and the dierent
steps that we have followed to achieve the morpho-syntactic annotation task. Then, we
will explain each step in this process with a full description.
4.2 Our main Method
In our method, morpho-syntactic annotation process follows dierent steps as described
in Figure 4.1. The process starts by extracting speakers' text and some useful word
annotations (word language, word named entity, etc) from the transcription le. These
pieces of information are then saved in another le with a specic structure to be used
in the next analyzing step.
Then, we used two morphological analyzers (MADA and Al-Khalil TD version) and a
dictionary depending on each word characteristics (word language, word Onomatopoeia,
etc) and by applying rules that we have established to judge the suitable grammatical
class for each word. Finally, after the analysis's conrmation by the user, we save the
result of the morpho-syntactic annotation in an XML le respecting a specic structure.
As a result, each word is assigned with a tag that indicates its grammatical class. More
details about these steps are introduced in the next sections.

34 Chapter 4. Morpho-syntactic Annotation Method
Transcription File
Preliminary Step
Segmented File
Word Analysis
Choosing Result
Result File Generation
Annotated File
Tunisian Dialect Dictionary
MADA Analyzer
Al-Khalil TD version analyzer
Figure 4.1: Morpho-syntactic annotation steps in our method
4.3 Preliminary Step
The purpose of the preliminary step is to import speaker information and their speech text
with all useful annotation (named entities, word language, ..). Indeed, these annotations
vary depending on the used transcription tool (Transcriber or Praat).
In the Table 4.1, we list the information extracted from annotation in the transcrip-
tions les.
The process of information extraction is summarized in the following steps.
1. Collect speech text: Speech text for each speaker is divided into many Speech Turns,
so we gather them in a unique text for each speaker.
2. Speech text cleaning up: Speech text includes many annotations; some of anno-
tations are very useful in the morpho-syntactic annotation process. Some other
annotations such as noise and music are removed because they are not useful in
this task.

4.3. Preliminary Step 35
3. Split speech text to sentences: Using punctuation annotation (` !',`.',` ?') we divide
speech text into a list of sentences.
4. Extract words annotations: Words annotations (Pronunciation notation, Disuen-
cies, Named entities, Word language) described in the Table 4.1 are extracted from
the transcription le then will be used in the morpho-syntactic analysis process.
5. Generate Segmented Text le: After extracting useful annotation from the tran-
scription le, we generate a structured le to be used in the next step.
In Table 4.2 we present a description of each tag we have used in the segmentation
text le.
Table 4.2: Description of tags used in the segmented text le
Tag Name Description
Text Text The text of a transcribed le
sp Speaker
Speech
The text of each speaker in a transcribed le
s Sentence Sentence from a speaker's speech
w Word A word from a sentence
ponct Punctuation The punctuation of a given sentence could be one
of these three values: . or ! or ?.
Example:
s id=sID1 ponct=.
id id In the Text element, the id attribute identify
the le an have as a value the name of the le.
For the other elements, id identify each ele-
ment by following a specic codication: Element
tag+ID+Number
Example:
w id=wID10 ...
s id=sID9 ...
Continued on next page

Table 4.2 continued from previous page
na Named
entity
The attribute na is used to identify words which
are the name of person or place.
Example:
w id=wID1613 na=B_N
Value
0

ÖÞ
…
/Value
/w
w id=wID1614 na=I_N
Value ú

q
F
m
F
Ì
9d /Value
/w
elis Elision The attribute elis is used to identify a word which
contains elision. In fact, the attribute elis con-
tains the word with an elision between parentheses
and the Value element contains the correct word.
Example:
w id=wID30 elis=
©
‰
©
g(

d) ©
áƒ
Value
©
‰
©
g

e
©
tƒ/Value /w
wLang Word
Language
The default language of words in the transcribed
text is the TD. All other languages are considered
as foreign. Indeed, we take in consideration three
possible values for foreign language:
Fr: French language
En: English language
MSA: Modern Standard Arabic language
wTrans Word
Transliter-
ation
The transliteration of a foreign word writing in
SAMPA pronunciation.
Example:
w id=wID1612 wLang=fr wTrans=
Valueprofesseur/Value
/w
Continued on next page

4.3. Preliminary Step 37
Table 4.2 continued from previous page
hesi Hesitation The attribute hesi identies an hesitation word.
Example:
w id=wID1980 hesi=

d /
onom Onomato-
poeia
The onom attribute identies an Onomatopoeia
word.
Example: w id=wID41 onom= d /
The Figure 4.2 present an example of a segmented le.
Figure 4.2: Example of the Segmented Text
1 ?xml version='1.0' encoding='UTF−8'?
2 Text id=012
3 sp id=1
4 s id=1 ponct=.
5 w id=1 wLang=Fr wTrans=madame
6 Valuemadame/Value
7 /w
8 w id=2 na=S_N
9 Value

éÒ£ e
©
¯/Value
10 /w
11 w id=3
12 Valueú

©
¯/Value
13 /w
14 w id=4
15 ValueÉ’
©
¯/Value
16 /w
17 w id=5 hesi=È d /
18 w id=6
19 Value
©
t

’Ë d/Value
20 /w
21 !−− the remainder of words −−
22 /s
23 !−− the remainder of sentences −−
24 /sp
25 !−− the remainder of speakers −−
26 /Text

4.4 Word analysis
To begin with, the characteristics of word are used to assign it with the relative tag.
Indeed, words that have characteristics as onomatopoeia, named entities and word
language (French or English) are assigned respectively with the tags Onomatopoeia,
Named entity and Not-Recognized.
Then, we used two analyzers (MADA and Al-Khalil TD version) and dictionary,
according to the rules dened below, to identify the grammatical class of each word. Two
ways of process may arise according to the language of each word: Tunisian dialect words
and modern standard Arabic words. We detail in the following these ways.
a) Tunisian Dialect Words
Figure 4.3: Analyzing process of a TD word
Analyzing a TD word follows dierent steps as described in Figure 4.3. Indeed, to
analyze a TD word, we start analyzing it by using TD dictionary as described in Figure
4.4. If this process does not give any analysis, we analyze the given word by using

4.4. Word analysis 39
Al-Khalil TD version analyzer. If the analyzed word is neither recognized by the TD
dictionary or Al-Khalil analyzer, we remove its diacritics and we reanalyze it with the
TD dictionary.
Analyzing a word without diacritics allows us to solve the problem of dierence in
writing diacritics of same words. For example word ©
áº‚)

could be written ©
áº‚)

or
©
áº‚)

according to the dialect of the speaker.
If these process do not lead to any analysis, we reanalyse the given word without
diacritics by using Al-Khalil TD version analyzer. Finally, if there is no possible analysis
we analyze the given word with MADA analyzer. Indeed, many words without diacritics
are written at the same as in MSA or TD. Thus, analyzing these words with a MSA
analyzer could succeed in getting possible analysis.
During this process, if there is more than one analysis we precede to organize them.
When there is no possible analysis for a given word in our method, we assign it with
the tag unknown. Furthermore, our system allows the user to interfere by choosing the
correct one or by adding a new analysis. In the following Figure 4.4, more details about
the used procedure mentioned above.
Figure 4.3 illustrates these steps. Furthermore, our system allows the user to interfere
by updating analysis or by adding other analysis.
Analyzing with TD Dictionary:
Analyzing a word using TD dictionary is handled by applying a morpheme segmenta-
tion method as used in [Yang 2007]. Figure 4.4 shows the dierent steps which had been
taken to analyze a word with the TD dictionary. These steps are executed sequentially
until we get an analysis result.
First, we search in the TD dictionary according to this order: Conjunctions, Pronouns,
Number words, Interjections, Particles, Adjectives, Adverbs, Nouns, Verbs. Second, we
look sequentially at Adverbs, Nouns, Verbs dictionary through trying all possible TD
prexes. Third, we do the same task with suxes. Finally, we resume the same procedure
with prexes and suxes.
Analysis Ranking:
We classify and conrm each word analysis according to the following order: TD
Dictionary, Al-Khalil TD version analyzer, MADA analyzer. Furthermore, if these tools
and resources give the same analysis, we keep one of them.

Figure 4.4: Analyzing a word with TD dictionary
b) Modern Standard Arabic Words
We used two analyzers MADA and Al-Khalil to analyze MSA word. First we use
MADA. Then, if there is no possible analysis, we analyze with Al-Khalil analyzer. More-
over, if there is no possible analysis we remove diacritics and we reanalyze them again
with Al-Khalil.
4.5 Choosing results
While using Al-Khalil adapted analyzer and Dictionary. A problem of having many
analysis for the same word could appear. This problem is caused by the dierence between
word diacritics writing or by ambiguity.

4.5. Choosing results 41
Problem with Al-Khalil analysis:
Usually, Al-Khalil adapted analyzer return a list of analysis for a given word with dif-
ferent information (gender, prex, sux, gender, number, person,voice, etc). In general,
this problem which is related to the ambiguity in Arabic language.
Problem while using dictionary:
The problem in dictionary analysis is related to word diacritics writing. To solve
the problem, we try to classify these results by comparing their distance to the original
word. Indeed, we used the Levenshtein distance [Haldar 2011] for measuring the dierence
between two sequences or words.
Mathematically, the Levenshtein distance between two strings a, b is given by
leva,b(|a|, |b|) where:
leva,b(i, j) =



0 , i = j = 0
i , j = 0 and i 0
j , i = 0 and j 0
min



leva,b(i − 1, j) + 1
leva,b(i, j − 1) + 1
leva,b(i − 1, j − 1) + [ai = bj]
, else
• leva,b(i − 1, j) + 1: The minimum corresponds to removing (from a to b).
• leva,b(i, j − 1) + 1: The minimum corresponds to insertion (from a to b).
• leva,b(i − 1, j − 1) + [ai = bj] : The minimum corresponds to match or mismatch
(from a to b), depending on whether the respective symbols are the same.
Example:
Original word: e

©
t(

‚Ó
Word returned while analyzing: e

©
t(

‚

Ó
⇒ Distance between two words is 2 as calculated in Table 4.3.
When we have analysis from both Al-Khalil TD version analyzer and TD dictionary
(case that we have problem with diacritics writing), we used the dictionary analysis to
conrm Al-Khalil analysis if they have the same grammatical function.
Finally, we generate the annotation le as described in the following section.

4.6 Result le generation
These tags are while generating the annotation le are presented in Table 4.4.
Table 4.4: Description of tag used in the annotation result le
asp aspect The aspect (orders or requests, perfective, imper-
fective)
vox voice The voice (active, passive, etc.)
stt state The state (indenite, denite, construct, etc.)
per person The person (1st, 2nd, 3rd)
num number The number (plural, dual, singular)
gen gender The gender (feminine, masculine)
case case nominative The case nominative accusative genitive
sux sux The sux of the word
pattern pattern The pattern of the word
root root The root of the word
stem stem The stem of the word
spee part of speech The part of speech of the word
prex prex The prex of the word
4.7 Conclusion
A lot of problems complicated the used tools integration process. These were mainly
due to the lack of dierent input/output formats of every tool and granularity of tag sets.
In addition, other problems are faced while using the analysis tools. For example, MADA
analyzer ignores all word characteristics such as the language of the word which could
aect the analysis result. Another problem occurred when Al-Khalil analyzer return
many analysis, so the user has to intervene to choose one of them.

4.7. Conclusion 43
Table 4.1: Annotations extracted from the transcription les
Pronunciation notation
- Elongation
- Liaisons
- Elisions
- Incomprehensible sequence
Disuencies
- Incomplete word
- Onomatopoeia
Named entities
- Person name
- Place name
Word language
- TD
- MSA
- French
- English

Table 4.3: Levenshtein distance table example
1- 0 1 2 3 4 5 6 7
Ð

€

d ø

è
©
à

d d
-1 0 1 2 3 4 5 6 7 8
0 Ð 1 0 1 2 3 4 5 6 7
1

€ 2 1 0 1 2 3 4 5 6
2 d

3 2 1 1 2 3 4 5 6
3 ø

4 3 2 2 1 2 3 4 5
4
©
à 5 4 3 3 2 2 2 3 4
5

d 6 5 4 3 3 3 3 2 3
6 d 7 6 5 4 4 4 4 3 2

Chapter 5
Realization and Performance
Evaluation
5.1 Introduction
The expansion of NLP application for dialectal requires a high amount of resources
in terms of data and tools. By developing a morpho-syntactic annotation tool for the
TD language, we facilitate the morpho-syntactic annotation task which promotes to the
build of corpus.
In this chapter, we introduced the used tools and resources used in our tool. Then, we
present our TD annotation tool by explaining the dierent modules, functionality, and
by providing some details about the development environment. Finally, we experiment
our tool and we discuss the results obtained in dierent assessments.
5.2 Tools and Resources
The morpho-syntactic annotation process contains several tasks that can be handled
using tools and resources. In this section, we introduced the used analyzers and dictionary.
5.2.1 Al-Khalil analyzer
The Al-Khalil analyzer was developed to produce tags for a given text by executing a
morphological analysis of the text. The lexical resource consists of several classes that
handle vowelled and unvocalized words. The main process is based on using patterns for
both verbal and nominal words; Arabic word roots and axes.
Indeed, according to [Altabba 2010] Al-Khalil analyzer is still the best morphological
analyzer for Arabic. In addition, Al-Khalil won the rst prize at a competition by The
Arab League Educational, Cultural Scientic Organization (ALESCO) in 2010.

46 Chapter 5. Realization and Performance Evaluation
5.2.2 MADA analyzer
The MADA+TOKAN toolkit is a morphological analyzer that is introduced by
[Habash 2009] and used to derive extensive morphological and contextual information
from Arabic text. Indeed, the toolkit includes many tasks; high-accuracy part-of-speech
tagging, diacritization, lemmatization, disambiguation, stemming, and glossing. In ad-
dition, MADA classies the analysis results and gives as an output the more suitable
analysis to the current context of each word.
The analysis results carry a complete diacritic, lexemic, glossary and morphological
information. Also, TOKAN takes the information provided by MADA to generate to-
kenized output in a wide variety of customizable formats which allow an easier extraction
and manipulation. In addition, MADA achieved an accuracy score of 86% in predicting
full diacritization and 96% on basic morphological choice and on lemmatization.
5.2.3 Al-Khalil TD version
Recent research in our ANLP group [Amouri 2013] conducted the study of the dialect
language by adapting the Al-Khalil analyzer to the TD language. Thanks to the enrich-
ment of the transformation rules, the adapted analyzer achieves a score of 81.17% and
96.64% in terms of recall and accuracy for verbs correctly analyzed.
5.2.4 Tunisian Dialect Dictionary
The TD dictionary [Ayed 2013, Boujelbane 2013] were constructed using lexical units
in the Arabic Tree bank corpus and their parts of speech, to convert words from the MSA
to TD language. The obtained results consist on an XML lexical database composed
of nine dictionaries (Conjunctions, Pronouns, Number words, Interjections, Particles,
Adjectives, Adverbs, Nouns, Verbs).
5.3 Tunisian Dialect Annotation Tool
This section is dedicated to present our TD annotation Tool. Indeed, the rst part
will clarify the usefulness of our system and its functionality. The second part is divided
to clarify details, characteristics, and the development environment.

5.3. Tunisian Dialect Annotation Tool 47
5.3.1 Process of TDAT
In order to specify and visualize the artifacts of our system, we will detail its func-
tionalities and its manipulation procedure. Also, we will introduce the structure of our
system.
a- System functionalities
The principal functionality of our system is to generate a morpho-syntactitc annotation
for each word in the transcription le. Indeed, this process is composed of two basic steps;
The rst is to segment the transcription le. The generated segmented le follows a unique
XML structure which allows a better representation of the text and speech phenomena.
The second step requires as an input a segmented text le, then analyses each word by
determining its suitable grammatical class.
Our annotation system allows an easier manipulation of the obtained analysis result.
Indeed, the user could show, update, save an annotation le, and open an unaccomplished
annotated le to complete it. Additionally, the user has the possibility to select through
the available option of dictionary and morphological analyzers which will be used in the
morpho-syntactitc annotation process.
• Segment a Transcription File:
The aim of the segmenting script is to prepare the transcription le to the entry of
our TDAT. Indeed, this tool will allow our TDAT to support dierent transcription
les types. Currently, the developed tool supports two transcripts les formats (trs
and TextGrid).
The generated le (Segmented Text) follows a unique structure (XML) that
allows a better representation of speech phenomena for the user. Indeed, the used
structure was created in accordance with the TEI recommendations. Furthermore,
the segmented text allows an easier interpretation of the orthographic transcription
by our system. Also, the Segmenting tool generates three other les:
- Words' list: contains a list of all words and their frequency.
- Sentences' list: contains a list of all sentences.
- Statistics: contains some useful statistics about the content of the transcription
le (See Figure 5.1 for more details).

1 ?xml version='1.0' encoding='UTF−8'?
2 STATISTIC id=110
3 Speakers3/Speakers
4 Sentence267/Sentence
5 Words3231/Words
6 Hezitation117/Hezitation
7 Onomatopoeia150/Onomatopoeia
8 Elisions102/Elisions
9 NamedEntity54/NamedEntity
10 /STATISTIC
Figure 5.1: XML Shema of the Annotated Text
In order to segment a transcription le (trs or TextGrid), the user has to open
or select a transcription le in the corpus les tree (see (2) in Figure 5.3 for more
details). Then, our tool starts the process by opening the transcription le. Indeed,
an error message appears when there is a problem while executing this process such
as unexpected le format.
Along this process, the system informs the user about the progress (See (4) in
Figure 5.3). Then, our tool leads the generated segmented text le and asks for
showing the obtained results. Finally, our system loads and shows in the statistic
menu the statistics information (number of words, sentence, speaker, hesitation,
onomatopoeia, elision and named entity) relative to the segmented le (see (2) in
Figure 5.4 for more details). The user is noticed if there is a problem while loading
the segmented le or the statistic le.
• List of Words Frequency:
To show the list of words frequency, an already segmented le must be opened
or selected in the segmented le list. Indeed, our system leads the list of words
frequency le relative to the selected segmented le. If there is a problem while
loading the list of words frequency, an error message will appear to inform the user.
• Annotation options:
To analyse a word, our system uses analyzers and a dictionnary. Those choices
could be updated before annotating a le by choosing from the available analyzers
and dictionary which one to use. In addition, the path of the used resources could
be easily updated (See Figure 5.8). These options take place after saving when a
new annotation process is launched.

• Segmented File Annotation:
The annotation process starts when the user wants to annotate a transcribed le.
Indeed, the user has the choice to segment an opened transcription le in the
segmented le list or to open an incomplete annotated le. Then, switch the selected
resources option (Dictionary and analyzers), the developed system launches the
analysis process.
During this process, the analysis results appear progressively in the annotation
window (see (1) in Figure 5.6 for more details) and another window appears to
inform the user about the progress (see (2) in Figure 5.6). In addition, our system
gives the user the possibility to show a recap of the analyzed words along the
annotation process (see Figure 5.7 for more details). If a problem occurs while
analyzing a word or executing one of the used morphological analyzer tool an error
message will appear in the console. Also, the user could stop all the current process
at any time (see (2) in Figure 5.6).
When there is more than one analysis for a given word, case of ambiguity, the
user has to intervene to select and to conrm the right analysis for the word (see
Figure 5.5). After conrming all the analysis, the user could save the annotation
le, otherwise, the system saves the incomplete le by conserving all the analysis
for each word that has not been conrmed. To save an annotation le, the user
has to select le format and the result directory path. Finally, a message will
appear to inform the user that the annotation le has been successfully saved and
a new window containing the obtained result appears (See Figure 5.9). Although,
a message will appear to inform the user that there is a problem while saving the
annotation le.
• Update Analysis Results:
The user has the possibility to update the analysis results while or after analyzing.
To update the analysis results, the user has to select the appropriate grammatical
class of a given word from the analysis list returned by our system. In addition, the
user could add a new analysis by typing the additional information such as prex
and sux (See Figure 5.5).
After conrming the new analysis, the system updates the annotation window
by adding the new tag in the top of the relative word analysis list. Indeed, the new

added analysis is considered as the best analysis so there is no need to conrm it
later.
b- System Collaborations:
Our system collaborates with other tools to generate the more suitable grammatical
class for each word. Indeed, our system interacts with:
• MADA analzyer: to analyze a text.
• Al-Khalil Analyzer: to analyze word.
• Al-Khalil TD analyzer: to analyze word.
• Perl Script: to segment a transcription le.
5.3.2 Realization
We choose JAVA programming language to develop our system for many reasons such
as:
• JAVA is one of the most popular programming languages in use
1
in 2012 thanks
to its simplicity.
• It is a platform-independent at both the source and binary levels.
• It allows creating modular programs which allows an easily use of predened struc-
ture of other projects in particular Al-Khalil source code.
Furthermore, we used multi-threaded to perform several tasks simultaneously especially
while performing the annotation process. By using this technology, rst of all, we in-
creased the processing speed. Second, we allowed a direct result display which allows the
user to intervene to conrm the returned analysis instead of waiting the end of the whole
annotation process.
The selection of the development environment Eclipse will allow us to program simul-
taneously with several programming languages in particular PERL and Java. Besides,
its extensibility in terms of programming language, this multi-platform environment is
already used in the development of Al-Khalil analyzer. Indeed, we conserved the same
project characteristics such as the le text encoding (Cp1256).
1http://en.wikipedia.org/wiki/Java_(programming_language)

In addition, we chose to work in the LINUX system environment to take advantage
of the speed. Also, this will allow a better performance of MADA analyzer which work
basically in this environment. Indeed, MADA analyzer is completely build in PERL
programming language.
To manipulate the segmented le and the annotation le, we used the XPath ex-
pression language. Indeed, the XPath language is based on a tree representation of the
XML document, and provided the ability to navigate around the tree by selecting nodes
with a variety of criteria. The XPath was dened by the World Wide Web Consortium
(W3C) and its use in our environment requires the JDOM package. Thus, we imported
the JDOM (version 2.0.0) library to our project.
Through the use of Perl programming language, processing large le size does not
exceed one second. Indeed, PERL provides many predened methods that allows an
easier manipulation and generation of text le and considered as the leader in the eld
of text le processing. Updating Perl code is also quite easy and avoids the update of
the whole application.
Furthermore, PERL is a multi-platform language and usually pre-installed in the
LINUX system. As we used MADA analyzer, using this programming language will
perform our TDAT and do not need any extra requirement. Also, we added EPIC
plug-in to the development environment Eclipse in order to manipulate PERL script.
5.3.3 TDAT Interfaces
Main Interface
The main interface is divided into four parts:
Main menu:
The purpose of the main menu (1) in Figure 5.2 is to provide an easy access to all
used les while using our application. Indeed, the menu is classied according to the
format of the les:
• File: File management
• Transcription: Management of the transcriptions les.
• Segmented Text: Management of the segmented text les.

Figure 5.2: Main Interface in the TDAT
• Annotation: Management of the annotation les.
Speed access menu:
The main purpose of the speed access menu (2) in Figure 5.2 is to easily access and
use the basic functions of our application that is why we classied these functions in four
themes:
1. Corpus
2. Transcription
3. Annotation
4. Statistics
Transcription interface
The content of a transcription le could be visualized (see (1) in Figure 5.3) through
the use of the corpus tree (see (2) in Figure 5.3). Indeed, the corpus tree contains all the
transcription les in the corpus.

Figure 5.3: Transcription window in the TDAT
Segmented text interface
To start segmenting (see (3) in Figure 5.3) a transcription le, the user has to choose a
le from the corpus tree. The segmentation process is presented as in Figure 5.3. When
the process is accomplished a new window that contains the segmented text appear (see
(1) in Figure 5.4). In addition, the statistics menu (see (2) in 5.4) shows the content of
its relative statistics le.
Add analysis interface
The user could update analysis results by selecting the right grammatical class for each
word (see Figure 5.6). Furthermore, the user could add a new analysis for a given word
by choosing the grammatical class and writing its prex and sux in the Add analaysis
interface (see Figure 5.5).
Analysis interface
The analysis in the Figure 5.6 appears progressively. As well, the status icon changes ac-
cording to the analyze advancement. Also, another window appears to show the progress
of the used options (see (2) in 5.6).

Figure 5.4: Segmented Text window in the TDAT
Figure 5.5: Add an analysis result window in the TDAT
The Table 5.1 describes each icon signication:
Table 5.1: Description of icons used in the annotation interface

Figure 5.6: Analyze window in the TDAT
Icon Description
No possible analysis
One analysis
Many analysis
Add an analysis
Conrm an analysis
Analysis given by MADA analyzer
Analysis given by Al-khalil analyzer
Analysis given by TD Dictionary
An analysis proposed by the user
An Analysis extracted from the transcription

Analysis details interface
Figure 5.7: Analyse Details window in the TDAT
The statistic button in the annotation interface (see (3) in Figure 5.6) gives the user
details about the recognized words (see Figure 5.7). Indeed, these statistics are updated
automatically while the analysis process.
Annotation options interface
Analysis result le interface
To save the analysis, the user has to choose the le format and to enter the result le
name and path. The generated annotation le is showed in gure 5.9.

5.4. Evaluation 57
Figure 5.8: Annotation options window in the TDAT
5.4 Evaluation
Evaluate a morpho-syntactic annotation system allows to determine its capabilities
and diagnose its strengths and weaknesses. Thus, the evaluation process requires a lot of
objectivity.
The most known evaluation method is to compare the performance of the developed
system with other similar systems. Such system must has the same input and output.
Also, the capability of analyzing word with its diacritics could be a decisional factor in
the evaluation.
Another method was used in state-of-the-art to evaluate a morphological analyzer
system which is based on comparing the analysis results with a gold standard.
Apparently, there is no available gold standard morpho-syntactic annotation for the TD
language yet. Thus, we developed a gold standard for evaluating our tool as described
below. Then, we evaluate three basic modules of the developed system. Finally, we
conclude the strengths and weaknesses of the developed TDAT.

Figure 5.9: Analysis Result File window in the TDAT
5.4.1 Gold standard for the TD language
In order to evaluate our system, we aimed to develop a gold stander for the TD
language which is composed of two annotated transcription les (2409 words). Indeed,
these morpho-syntactic annotation were created manually by an expert in linguistics.
The used annotation tags are the same in the TD dictionary (Conjunction, Pronoun,
Number word, Interjection, Particle, Adjective, Adverb, Noun, Verb). Also, we included
the sux and prex as an additional information.
5.4.2 Evaluate the TDAT
We choose to evaluate three modules of our system:
Evaluate the words segmenter:
One of the basic tasks while analyzing a word using dictionary is to determine their
prex and sux. Indeed, words in dictionary could include sux and prex, so the
system has to decompose it using our module of words segmenter. In order to evaluate
this module, we rst analyze words without segmenting it, then we used our segmenter
to identify possible prex and sux for each word not recognized by our system.
Example:

5.4. Evaluation 59
The following word e

êË e

©
t(

‚

Ó could not be recognized while analyzing it with the used
dictionary. However, by using our word segmenter module in the dictionary analyzer,
our system identies its sux (e

êË) and annotate it as a verb.
The evaluation results in terms of Accuracy are detailed in Table 5.2. The Accuracy is
described as:
Accuracy = TP+TN
TP+TN+F
= T
AW
• TP: Word recognized and correctly analyzed
• TN: Word recognized and not correctly analyzed
• F: Word not recognized
• T: All recognized word
• AW: All analyzed words
Table 5.2: Evaluation results of the word segmenter module
Recognized Not recognized Accuracy
Before using the
Segmenter module
239 826 22,44%
After using the
Segmenter module
410 655 38,49%
Indeed, by integrating the segmenter module, we achieved an improvement of 16.05%
in terms of Accuracy while using the dictionary module.
Evaluate the Levenshtein distance:
Analyzing a word with the TD dictionary could lead much analysis which is primarily
due to the dierence in diacritics writing. After testing our system in a transcription
text that is composed of 1065 words, we nd out that 47% (81 cases) of the analysis
obtained from the dictionary module are ambiguous. Indeed, multiple choices appear for
each word which is owing to the use of segmenter module. To rank these analysis, we
used the Levenshtein distance.

Example: Word: e

ëñt

¢

ª

©
u
Possible analysis without applying the Levenshtein distance: ù

¢

ª

©
u, ñt

¢

ª

©
u
Possible analysis after applying the Levenshtein distance: ñt

¢

ª

©
u, ù

¢

ª

©
u
In this example the system selects in the rst position the word ñt

¢

ª

©
u which represents
the shortest distance comparing to the original word e

ëñt

¢

ª

©
u.
In order to evaluate the usefulness of the employment of the Levenshtein distance
function, we evaluate the Precision score of the TD dictionary module. The Precision
formula is dened as:
Precision = TruePositive
TestOutcomePositive
= CorrectAnalaysisWord
AllRecognizedWord
As our system considers the top list analysis of a word as the best analysis, organizing
these results analysis using the Levenshtein distance will improve our system performance.
In Table 5.3 the evaluation results of the TD dictionary module before and after the
employment of the Levenshtein distance procedure.
Table 5.3: Evaluation results of the TD dictionary module using the Levenshtein distance
function
Correctly
analyzed
Not correctly
analyzed
Precision
Before using the
Levenshtein distance
52 29 64.19%
After using the
Levenshtein distance
77 4 95.06%
We notice that in some cases organizing word based in its sux and diacritics distance
do not solve the problem of ambiguities. Indeed, the integration of an classication
module based on word sentence context could solve the ambiguity problem.
Evaluate the analysis results:
The main function of our tool is to deliver as an output a morpho-syntactic annotation
for each word in the input. In order to evaluate this module, we used our Gold standard
as a corpus test.

5.5. Conclusion 61
The evaluation results of the TDAT while analyzing using all resources options are de-
tailed in Table 5.4.
Table 5.4: Evaluation results of the TDAT
Recognized Not recognized
Correct
analysis
1803 251
Incorrect
analysis
355
We conclude a Precision score of 83.54%. Indeed, the incorrect analysis are all most
Adjective interpreted as Verb. Thus, the problem was detected when analyzing complex
TD word with Al-Khalil analyzer. The used patterns give an incorrect interpretation
when removing the word suxes.
When using the MADA analyzer, some words are tagged as Noun which is an incor-
rect analysis caused by the dierence in meaning of these words in the TD and MSA
languages.
Analysis errors when using the TD module are owing to the diacritics dierence. This
problem is caused by the TD language speaking dierence.
We used the F-measure to study the results analysis quality.
The F-measure is dened as:
F-score1 = 2∗TP
2∗TP+F
• TP: Correct analysis word
• F: Not recognized word
We obtained an F-score1 of 91.03% which is considered a promising result compared
to the existing tool for TD language.
5.5 Conclusion
In this chapter we presented the TDAT which is developed to handle the morpho-
syntactic annotation task. In order to allow an easier use and expend of our project,

we used free software that supports multi-platform. As well, the TDAT use dierent
options of resources which led to obtain a detailed analysis. Despite the dierence in the
needed time while analyzing comparing to the other used tools, MADA analyzer is still
very useful when using transcript that contains ocial discussion such as TV dialogue.
The obtained result when using all resources options are very promising. Indeed, we
achieved an F-score1 of 91.03% while testing in our test corpus. In addition, the developed
tool could be improved by classifying the results analysis. Also, the enrichment of the
used TD dictionary could load to achieve better result especially for Noun.

Conclusion
The tools for dialectal Arabic are few and often miss certain features or do not reach up
to the same standard as their MSA counterpart. In fact, there is a need for resources and
tools for the Arabic dialects in order to start creating new and better NLP applications.
By developing the TDAT, we aimed to provide a tool that accepts dierent types of
transcription format in order to produce morpho-syntactic annotation for TD words.
In order to build a morpho-syntactic annotation corpus for the TD language, we started
by collecting speech data. Then, we transcribed the collected data following our or-
thographic transcription guidelines and using two transcription tools (Transcriber and
Praat). Finally, we developed a tool that accepts the elaborated transcription les as an
input and gives as an output a morpho-syntactitc annotated le. To handle this task, our
tool uses TD dictionary and two analyzers (Al-Khalil TD analyzer and MADA analyzer).
In addition, the used resources options could be easily updated.
During the transcription process we created a corpus that consists of more than 1 hour
and 25 minutes. A portion of the developed corpus was used to train the developed
system.
In order to determine the capabilities of our TDAT tool to analyse TD text, we con-
structed a test Gold standard for the TD language. We used a portion of this corpus
to test the dierent modules of our tool. The evaluation results show that the used seg-
menter module realizes a score of 95.06% in terms of precision. However, using dictionary
for analyzing, regarding the obtained Accuracy score, still need more improvement by en-
riching the used dictionary especially the Noun dictionary. Due to the use of Al-Khalil
TD and the other resources options, our tool attains an F-score1 score of 91.03%.
The developed corpus could be enlarged it by integrating other topics. Furthermore,
our corpus contains dierent subjects that can be used in learning a linguistic analysis
models, in the automatic speech processing, or in any other areas of natural language
processing.
The analysis results obtained by our developed tool could be improved by: Enlarging
the TD lexical database. The use of a classication module, based on statistics for
example, to classies analysis results. Also, updating Al-Khalil pattern and database
by studying the new ambiguous case while analyzing. The input of our system could be

64 Conclusion
modied to support other speech text format such as internet web page since the dialectal
language is in increase use in social network.

Master Thesis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Master Thesis

Similar to Master Thesis (20)

Master Thesis