SlideShare a Scribd company logo
1 of 96
Download to read offline
University Of Sfax
Faculty of Economics and Management of Sfax
Multimedia, InfoRmation Systems and Advanced Computing Laboratory
M A S T E R T H E S I S
to obtain the title of
Master Degree in Computer Systems and new Technology
Defended by
Imad Eddin Jerbi
Construction and Morpho-syntactic
Annotation of a Colloquial Corpus:
Case of Tunisian Arabic
Supervisor: Mariem Ellouze
Co-supervisor: Inès Zribi - Rahma Boujelbane
defended on 27th
December 2013
Jury:
President : Lamia Hadrich Belguith Professor (FSEGS)
Reviewer : Maher Jaoua Associate Professor (FSEGS)
Advisor : Mariem Ellouze Associate Professor (ESCS)
Invited : Inès Zribi University of Provence
Rahma Boujelbane University of Sfax
Acknowledgments
I would like rst to express my thanks to the head of ANLP-RG
Mrs. Lamia Hadrich Belguith for accepting me in the research group.
Above all, I would like to express my deepest appreciation to my supervisor
Mrs. Mariem Ellouze Khemakhem and the co-supervisor Miss. Inès Zribi and
Miss. Rahma Boujelbane - you have been a constant source of encouragement and
guidance, and your faith in me is largely responsible
for not only completing this thesis but also enjoying working on it.
I would like to thank the jury members:
Mr. Maher Jaoua, Mrs. Lamia Hadrich Belguith,
Mrs. Mariem Ellouze, Miss. Inès Zribi and Miss. Rahma Boujelbane
for their precious time reading my thesis and for their constructive comments.
I must not forget to thank my professors who generously shared their expertise.
Also, I especially thank the master department director Mr. Mahmoud Naji.
I also would like to thank my family and all my friends especially Hakim Mkacher and
Hamdi Zroud for their support and help.
Thank you ALL!
Contents
Acknowledgments i
List of gures v
List of tables vii
List of Abbreviations xi
Introduction 1
I Related Work 3
1 Linguistic Resources 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Speech and Text Data Collection . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Arabic Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Modern Standard Arabic Corpora . . . . . . . . . . . . . . . . . . 6
1.3.2 Dialectal Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Orthographic Transcription . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Transcription Software . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Transcription Guidelines . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Morpho-Syntactic Annotation 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Morpho-syntactic Annotation Methods for MSA Language . . . . . . . . 13
2.3 Morpho-syntactic Annotation Methods for Dialects Arabic Language . . 16
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
II Proposed Method 19
3 Data Collection and Transcription 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Speech Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
iv Contents
3.3 Transcription Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Transcription Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.2 Transcribing Guidelines . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Morpho-syntactic Annotation Method 33
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Our main Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Preliminary Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Word analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Choosing results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 Result le generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Realization and Performance Evaluation 45
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 Al-Khalil analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.2 MADA analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.3 Al-Khalil TD version . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.4 Tunisian Dialect Dictionary . . . . . . . . . . . . . . . . . . . . . 46
5.3 Tunisian Dialect Annotation Tool . . . . . . . . . . . . . . . . . . . . . . 46
5.3.1 Process of TDAT . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.2 Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.3 TDAT Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4.1 Gold standard for the TD language . . . . . . . . . . . . . . . . . 58
5.4.2 Evaluate the TDAT . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Conclusion 63
A TD Enriched Orthographic Transcription 65
A.1 Inter-pausal units segmentation . . . . . . . . . . . . . . . . . . . . . . . 65
A.2 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2.1 Typographic rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2.2 Pronunciation notation . . . . . . . . . . . . . . . . . . . . . . . . 67
A.2.3 Liaisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Contents v
A.2.4 Non Arabic phonemes . . . . . . . . . . . . . . . . . . . . . . . . 68
A.2.5 Reported speech . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.2.6 Incomprehensible sequences . . . . . . . . . . . . . . . . . . . . . 68
A.2.7 Laughers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.2.8 Pauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
B LCD Commitment 71
Bibliography 73
List of Figures
3.1 The proportion of themes in the corpus . . . . . . . . . . . . . . . . . . . 22
3.2 Sections and Turns in Transcriber . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Manage speakers in Transcriber . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Manage speakers in Praat . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Morpho-syntactic annotation steps in our method . . . . . . . . . . . . . 34
4.2 Example of the Segmented Text . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Analyzing process of a TD word . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Analyzing a word with TD dictionary . . . . . . . . . . . . . . . . . . . . 40
5.1 XML Shema of the Annotated Text . . . . . . . . . . . . . . . . . . . . 48
5.2 Main Interface in the TDAT . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Transcription window in the TDAT . . . . . . . . . . . . . . . . . . . . . 53
5.4 Segmented Text window in the TDAT . . . . . . . . . . . . . . . . . . . 54
5.5 Add an analysis result window in the TDAT . . . . . . . . . . . . . . . . 54
5.6 Analyze window in the TDAT . . . . . . . . . . . . . . . . . . . . . . . . 55
5.7 Analyse Details window in the TDAT . . . . . . . . . . . . . . . . . . . 56
5.8 Annotation options window in the TDAT . . . . . . . . . . . . . . . . . 57
5.9 Analysis Result File window in the TDAT . . . . . . . . . . . . . . . . . 58
List of Tables
3.1 Corpus les content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Description of tags used in the references XML le . . . . . . . . . . . . 24
3.3 Clitics in the TD language . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Description of tags used in the segmented text le . . . . . . . . . . . . . 35
4.4 Description of tag used in the annotation result le . . . . . . . . . . . . 42
4.1 Annotations extracted from the transcription les . . . . . . . . . . . . . 43
4.3 Levenshtein distance table example . . . . . . . . . . . . . . . . . . . . . 44
5.1 Description of icons used in the annotation interface . . . . . . . . . . . . 54
5.2 Evaluation results of the word segmenter module . . . . . . . . . . . . . 59
5.3 Evaluation results of the TD dictionary module using the Levenshtein
distance function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Evaluation results of the TDAT . . . . . . . . . . . . . . . . . . . . . . . 61
List of Abbreviations
AMADAT Arabic Multi-Dialectal Transcription Tool
AOC Arabic Online Commentary Dataset
CES Corpus Encoding Standard
DTD Document Type Denitions
ECA CALLHOME Egyptian Arabic Speech
ELAN EUDICO Linguistic Annotator
FBIS Foreign Broadcast Information Service
HMM Hidden Markov Model
HPL Hewlett-Packard Laboratories
ICA Iraqi Colloquial Arabic
LA Levantine Arabic
LATB Levantine Arabic TreeBank
LDC Linguistic Data Consortium
MSA Modern Standard Arabic
NLP Natural Language Processing
OSAC Open Source Arabic Corpora (Updated)
OSAc Open Source Arabic Corpus
POS Part of Speech
POST Part of Speech Tagging
SAAV B Saudi Accented Arabic Voice Bank
SAMPA Speech Assessment Methods Phonetic Alphabet
STT Speech-to-Text
TD Tunisian dialects
TDAT Tunisian Dialect Annotation Tool
TEI Text Encoding Initiative
XML Extensible Markup Language
XSL eXtensible Stylesheet Language
Introduction
The Arabic language is speaking by about 300 million people [Al-Shamsi 2006] and
the fourth most spoken language. Thus, it is a major international modern language.
Considering the amount of people it is spoken by, the number of computing resources for
the Arabic language is still few. The Arabic language is a blend of the modern standard
Arabic, used in written and formal spoken discourse, and a collection of related Arabic
dialects. This mixture was dened by Hymes [Hymes 1973] as a linguistic continuum.
Indeed, Arabic dialects present signicant phonological, morphological, lexical, and syn-
tactic dierences among themselves and when compared to the standard written forms.
Furthermore, the presence of diglossia [Ferguson 1959] is a real challenging issue for
Arabic speech language technologies, including corpus creation to support Speech-to-
Text (STT) systems. In addition, other diculties for researchers lie in Arabic dialects
as a morphologically complex language. Also, there is a small amounts of text data for
spoken Arabic due to the non ocially written of this language.
A better set of corpus will support, in the rst place, further research into the area for
example support linguistics researchers in analysis of the Arabic dialects phenomenas. It
also lays the ground for creating new and a better end-user application.
One of the fundamental parts of any application of the Natural Language Processing
(NLP)in a specic language, such as in the Tunisian Dialect (TD), is the existence of
corpora. Indeed, the construction of speech corpora for the TD is fundamental for study-
ing its specication and to advance its NLP application for example speech recognition.
Some small corpus exist nowadays and have been developed by previous researchs. How-
ever, these corpus combined other dialects and they are not specic for the TD also they
do not include any diacritics information. In general, these corpus are either in closed
projects or not freely available. In addition, these corpora do not include any morpho-
syntactic annotation or phonetic information.
The aim of this project is to investigate how to collect and transcribe speech data, the
possibility of using existing tools for transcribing, the choice of the appropriate guidelines.
How to annotate the transcripts?, which methods are used?,
The report is divided in two parts. The rst part presents the state of the art of ex-
isting speech corpora resources for Arabic language. The chapter two lists some morpho-
syntactic annotation methods. The second part describes the used method and resources
to collect, transcribe, and annotate speech data. The third chapter is devoted to present
2 Introduction
steps that we have followed to collect and transcribe speech data. In the fourth chapter,
we present our method to achieve the morpho-syntactic annotation task. The last chapter
is devoted to present the used tools and resources, the developed tool and the obtained
results. Finally, a conclusion that summarizes the results of our work and presents some
future prospects that will be given at the end of this report.
Part I
Related Work
Chapter 1
Linguistic Resources
1.1 Introduction
A spoken language corpus is dened as a collection of speech recordings which is ac-
cessible in computer readable form and which comes with annotation and documentation
sucient to allow re-use of the data in-house, or by scientists in other organizations
[Gibbon 1997]. Indeed, creating speech corpora is crucial for the studying of dierent
characteristics of the spoken language. As well as the development of applications which
deal with the voice, for example speech recognition system.
In this chapter, we will answer the following question: How to create a speech corpus?
Therefore, we present methods of speech and text data collection in the next section.
Then, we focus on presenting some available corpora for Arabic language. After that,
we introduced the orthographic transcription task by presenting a literature recap of its
guidelines and tools.
1.2 Speech and Text Data Collection
A prerequisite for a successful development of spoken language resources is a good
denition of the collected speech data. There are three steps in text data collection:
The rst is to precise the source of data (books, novels, chat rooms, etc), type (standard
written language or dialectal), theme (social, news, sport, etc), encoding and format
(transcribed les, web pages, XML les, etc). The second step called data collection,
which is performed using dierent techniques such as harvesting large amounts of data
from the web [Diab 2010] and speech data transcribing [Messaoudi 2004]. As well as an
automatic speech recognition method like used in [Messaoudi 2004, Gauvain 2000] could
be performed to extract text from speech data. The third step consists in adapting and
organizing these data [Diab 2010].
Consequently, speech data collection follows the same steps as text collection. In addition,
speech data with their dierent types (audio or video) and formats (mp3, wave, avi, etc),
6 Chapter 1. Linguistic Resources
are collected with dierent ways. The easier one, is to download streaming videos and
audios from the Internet. Unfortunately, in this method, we couldn't guarantee the
good quality of data. Otherwise, we need to refer to recording method where we could
x subjects and speakers dialects as we wish. At the same time, we could ensure a best
quality of data. However, we need funding to pay speakers and to buy specic equipments
for recording.
Some annotation tools [Kipp 2011] give the possibility to access directly to the broad-
cast data by using the associated Uniform Resource Locator (URL) which overcame the
step of collection. Regrettably, this feature is currently not available in the used voice
annotation tools.
1.3 Arabic Corpora
The Arabic language is composed of a collection of standard written language (Modern
Standard Arabic) and spoken dialects lanquage. The Arabic dialects are used extensively
in almost all everyday conversations. Therefore, they have a considerable importance.
However, owing to the lack of data and of poor resources, Natural Language Processing
(NLP) technology for Arabic dialectal is still in its infancy. Therefore, basic resources,
tokenizers, and morphological analyzers which are developed for Modern Standard Arabic
(MSA), are yet virtually non-existent for dialectal.
1.3.1 Modern Standard Arabic Corpora
There are many research projects involved in MSA corpora's development such as the
updated version of the Open Source Arabic Corpora (OSAC) described in [Saad 2010]
which include corpora; The British Broadcasting Corporation (BBC) Arabic corpus col-
lected from bbcarabic.com, the Cable News Network (CNN) Arabic corpus collected from
cnnarabic.com, and the Open Source Arabic Corpus (OSAc) collected from multiple sites.
The OSAC corpus contains about 23 Mega after removing the stop words.
Foreign Broadcast Information Service (FBIS) corpus is another corpus for MSA cre-
ated by [Messaoudi 2005] and used in [Vergyri 2004]. The data set comprises a collection
of radio newscasts from various radio stations in the Arabic speaking world (Cairo, Dam-
ascus, Baghdad) totalling approximately 40 hours of speech roughly 240K words. The
transcription of the FBIS corpus was done in Arabic script only and does not contain
any diacritic information.
1.3. Arabic Corpora 7
The Linguistic Data Consortium (LDC) provided the Penn Arabic Treebank
[Duh 2005] which is a data set of newswire text from Agence France Press, An Na-
har News, and Unmah Press transcribed in standard MSA script. There are more than
113,500 tokens which are analyzed and provided with disambiguated morphological in-
formation.
1.3.2 Dialectal Corpora
At present, the major standard dialect corpora are available through the LDC by the
DARPA EARS (Eective, Aordable, Reusable Speech-to-Text) program where develop-
ing robust speech recognition technology to address a range of languages and speaking
styles, and which includes data from the Egyptian, Levantine, Gulf, and Iraqi dialects.
Also, the LDC provides conversational and broadcast speech with their transcripts.
The Levantine Arabic (LA) is presented by the Levantine Arabic QT Training Data
Set [Maamouri 2006a] which is a set of phone conversations of Levantine Arabic speakers
[Duh 2005]. Furthermore, the data set contains approximately 250 hours of telephone
conversations. About 2000 successful calls have been collected which are distributed in
terms of regional dialect (Levantine, Egyptian, Gulf, Iraqi, Moroccan, Saudi, Yemeni).
Also, LA provided both the conversations speech and their transcript.
Moreover, another Arabic colloquial corpora called CALLHOME Egyptian Arabic
Speech (ECA), which has been used in [Duh 2005, Gibbon 1983], was dedicated to the
Egyptian dialect. Indeed, the data set consists of 120 telephone conversations between
native speakers of Egyptian dialect. In fact, ECA corpus contains both dialectal and
MSA words forms. Also, ECA was accompanied by a lexicon containing the morpholog-
ical analysis of all words, an analysis in terms of stem and morphological characteristics
such as person, number, gender, POS, etc.
Last but not least, the Saudi Arabic dialect was represented by the Saudi Accented
Arabic Voice Bank (SAAV B) [Alghamdi 2008] which is very rich in terms of its speech
sound content and speaker diversity within the Saudi Arabia. The duration of the total
recorded speech is 96.37 hours distributed among 60,947 audio les. Indeed, SAAV B
was externally validated and used by IBM Egypt Branch to train their speech recognition
engine.
8 Chapter 1. Linguistic Resources
English-Iraqi corpus is another Arabic corpora mentioned in [Precoda 2007] and which
consists of 40 hours of transcribed speech audio from DARPA's Transtac program.
Many small corpora were developed in order to satisfy specic needs like described
in [Al-Onaizan 1999, Ghazali 2002, Barkat-Defradas 2003] where ten speakers originally
from the western zone (Egyptian Arabic, Syrian, Lebanon and Jordan) and Moroccan
Arabic area (Algeria and Moroccan) who are listening to the story the north wind and
the sun in French and translated spontaneously into their dialects.
Some other research projects are limited to the collection of dialectal text data, for
example the Arabic Online Commentary Dataset (AOC) mentioned in [Zaidan 2011]
created with crawling the websites of three Arabic newspapers (Al-Ghad, Al-Riyadh,
Al-Youm Al-Sabe), the commentary data consists of 52.1M words and includes sentences
from articles in the 150K crawled webpages. In fact, 41% of the data content had dialectal
words.
Additional to the AOC, an Arabic-Dialect/English Parallel Text was developed by
Raytheon Bolt, Beranek and Newman (BBN) technologies, LDC and Sakhr Software.
Infact, this corpus contains approximately 3.5 million tokens from Arabic dialect sentences
with their English translations. Data in this corpus consist of Arabic web text that was
ltered automatically from a large Arabic text corpora provided by the LDC.
1.4 Orthographic Transcription
The acoustic signal of audio content may correspond to speech, music or noise, but also
mixtures of speech, music and noise. In addition to that, there is a variety of speakers and
topics in the same record. Indeed, transcribers can work on a given subject successively
or simultaneously. The sound quality of the recording (delity) may vary signicantly
over time.
Dierent stages of transcription work are: the segmentation of the soundtrack, the
identication of turns and speakers, the orthographic transcription, and verication. De-
pending on the choice of the transcriber, these steps can be conducted in parallel or
sequential manner instead of long portion of the signal.
The diculty of transcribing depends on the number of speakers involved in the record-
ings and their clear pronunciation. Processing many les in quick succession does not
make the work faster as the exhaustion slows down the process. Also, it is preferable
1.4. Orthographic Transcription 9
to take a rest between each le. In [Al-Sulaiti 2004] the average time for transcribing
a ve-minute Arabic spoken record for a non-professional typist, without including any
enriched orthographic annotation, is 1:50:42. The average time is the average between
the shortest and the longest time taken in transcription.
The step of annotation aim at structure the recording, which means to be segmented
and to describe the acoustic signal at dierent levels deemed relevant for the further
processing. The transcription could not reect neither the audio record perfectly nor
pronunciation of a same subject/term, and could be the subject of a deepen study on
semantic and syntax, pronunciation analysis.
Manual transcription of audio recordings such as radio or television streaming, will ad-
vance research in automatic transcription, indexing and archiving. Indeed, the transcrip-
tion will provide a linguistic resource and data that make possible the construction of an
automatic recognition system, which will be then used to produce automatic transcrip-
tion.
1.4.1 Transcription Software
There are dierent types of tools for labelling and annotation of speech corpora.
Some of them are addressed for audio formats such as Transcriber [Barras 2000], Praat
[D.Weenink 2013], SoundIndex1
, and AMADAT 2
, and other for video formats for exam-
ple Anvil [Kipp 2011] and EUDICO Linguistic Annotator (ELAN) [Dreuw 2008].
• Transcriber is a free software and has been used in many projects such as
[Messaoudi 2005, Piu 2007, Fromont 2012] and becoming very popular due to its
simplicity and eciency as it provides in an easier way transcribing and labelling.
• Praat is a productivity tool for phoneticians. It allows speech analysis, synthesis, la-
belling and segmentation, speech manipulation, statistics, learning algorithms, and
manipulation package includes statistical stu,produces publication-quality graph-
ics.
• SoundIndex is a tool that allows user to write tags audio at any level in the
hierarchy of an XML le by setting values for attributes such as the start and the
end of audio in the sound editor. The interpretation of tags audio is written in
XSL.
1Software documentation on http://michel.jacobson.free.fr/soundIndex/Sommaire.htm.
2AMADAT User Guidelines on http://projects.ldc.upenn.edu/EARS/Arabic/EARS_AMADAT.htm.
10 Chapter 1. Linguistic Resources
• Arabic Multi-Dialectal Transcription Tool (AMADAT) allows transcribing speech,
and gives a very helpful functionality that provides a correction level.
• Anvil is a free video annotation tool. It oers frame-accurate, hierarchical multi-
layered annotation driven by user-dened annotation schemes. The color-coded
elements which have been noted on multiple tracks in time-alignment had been
shown by the initiative board. Also, special features are cross-level links, non-
temporal objects and are as project tool for managing multiple annotations. It
allows the importation of data from the widely used, public domain phonetic tools
Praat and XWaves. Anvil's data les are XML-bad.
• ELAN is an annotation tool that allows creating, editing, visualizing and search-
ing annotations for video and audio data. This software aims to provide a sound
technological basis for the annotation and exploitation of multi-media recordings.
Adding to that, ELAN is specically designed for the analysis of language, sign
language, and gesture, yet it can be used in media corpora, with video and/or audio
data, for purposes of annotation, analysis and documentation.
1.4.2 Transcription Guidelines
The transcription process follows specic agreements to provide structured records in
thematic content, speakers, and other speech information. These tools produce informa-
tion which are called annotations. Nowadays, a lot of conventions have been made in
NLP projects to satisfy the need of homogeneous transcription manner and to provide
annotation enrichments. Generally, these conventions are depending on the used speech
data format and on the transcription tool.
When a speech corpus is transcribed into a written text, the transcriber is immediately
confronted with the following question: What are the reects of the oral speech realty in
a corpus?
A set of rules for writing speech corpora are designed to provide an enriched ortho-
graphic transcription. These conventions establish the annotated phenomena. Numerous
studies have been carried out in prepared speech, as for example for broadcast news
[Cam 2008].
However, conversational speech refers to an activity more informal, in which partici-
pants have constantly managed topic, dened speakers and distinguished speech turns
which correspond to changes in speaker [Gro 2007, Cam 2008, André 2008]. As a conse-
quence, numerous phenomena appear such as hesitations, repeats, feedback, backchan-
1.4. Orthographic Transcription 11
nels, etc. Other phonetic phenomena such as non-standard elision, reduction phenomena
[Meunier 2011], truncated words, and more generally, non-standard pronunciations are
also very frequent. All these phenomena can impact on the phonetization.
Hence, identifying dierent types of pause; long pause between turn-taking and short
pause between words, is very useful for a further purpose such as the development of a
voice recognition system [Alotaibi 2010].
In [Gro 2007] conventions focus on segment structures presented by elongation, trun-
cation, aspirations, sigh. The spontaneous oral production is a real problem in term of
annotation awing to according to [Shriberg 1994]:
Disuencies show regularities in a variety of dimensions. These regularities
can help guide and constrain models of spoken language production. In addi-
tion they can be modeled in applications to improve the automatic processing
of spontaneous speech.
Another denition of disuencies in [Piu 2007]:
Disuencies (repeats, word-fragments, self-repairs, aborted constructs, etc)
inherent in any spontaneous speech production constitute a real diculty in
terms of annotation. Indeed, the annotation of these phenomena seems not
easily automatizable, because their study needs an interpretative judgement.
In fact, there are dierent types of disuencies as described in [Piu 2007]:
• Repetition disuency is among the most frequent types of disuency in conversa-
tional speech (accounting for over 20% of disuencies), according to [Cole 2005]:
Repetition disuencies occur when the speaker makes a premature com-
mitment to the production of a constituent, perhaps as a strategy for
holding the oor, and then hesitates while the appropriate phonetic plan
is formed.
• Self-correction as described in [Kurdi 2003] is the substitution of a word or series
of words from an others, to modify or correct a part of the statement.
• In [Pallaud 2002] high lighted primers as an interruption morpheme being enuncia-
tion. Generally, disuencies could be combined simultaneously with the association
of at least two of phenomena mentioned above.
12 Chapter 1. Linguistic Resources
In [Dipper 2009]:
Transcription guidelines specify how to transcribe letters that do not have a
modern equivalent. They also specify which letter forms represent variants
of one and the same character, and which letters are to be transcribed as
dierent characters.
In [Cam 2008] there are numerous transcription rules related to the speech-text such
as how to write letters, punctuation, numbers, Internet addresses, acronyms, spelling,
abbreviations, hesitations, repetitions, truncation, absent or unknown words. Also a
list of markups was used to identify noise, pronunciation problems, back channel, and
comments.
In [André 2008] the specic pronunciations were recorded with the SAMPA phonetic
alphabet. General roles for transcribing the short vowels spelling, and diacritics have
been presented by The LDC Guidelines For Transcribing Levantine Arabic
3
.
1.5 Conclusion
Despite all the attempts made by the LDC and the other research projects to provide
speech corpora for Arabic dialects, some languages like Arabic Tunisian Dialects (TD)
still needs more improvements in corpus construction. Yet, these attempts knew several
challenges. Some of them are related to Arabic language and to general NLP issue.
Moreover, there are some problems that could be noticeable during the process of speech
transcription such as the ambiguity of the word transcription. Another problem accrued
when multiple sound sources are present, then it is necessary to focus on the transcription
source most emerging. As well, when two speakers are talking in the foreground, we could
transcribe both through the mechanism of super imposed speech.
3The Guidelines are available on the LDC website http://ldc.upenn.edu/Projects/EARS/Arabic/
www.Guidelines_Levantine_MSA.htm
Chapter 2
Morpho-Syntactic Annotation
2.1 Introduction
A transcript could be annotated by adding linguistic information for each word. In
[Sawaf 2010] a denition of linguistic Annotation is: Corpus annotation is the practice
of adding interpretative, especially linguistic, information to a text corpus, by coding
added to the electronic representation of the text itself. Indeed, grammatical tagging is
the task of associating a label or a tag for each word in the text to indicate its grammatical
classication.
Generally speaking, the morpho-syntactic annotation process relies on word structure
as described in [Al-Taani 2009]. Accordingly, patterns and axes are used to deter-
mine the word grammatical class following a dened rules. Moreover, this process
diers according to the used data and resources. Therefore, approaches used in the
morpho-syntactic annotation process vary from supervised to unsupervised as described
in [Jurafsky 2008].
In the following sections we introduce morpho-syntactic annotation methods for the Mod-
ern Standard Arabic (MSA) language and Tunisian Dialect (TD) languages.
2.2 Morpho-syntactic Annotation Methods for MSA
Language
The morpho-syntactic annotation process for MSA has been performed using dif-
ferent approaches such as statistical approach [Al-Shamsi 2006] and learning approach
[Bosch 2005]. Recently, some works combine approches to improve the performance of
the developped tagger. In the following, a description of some works.
The statistical approach was used in the work of [Al-Shamsi 2006] to handle the POST
of Arabic text. Indeed, the developed method was based on the HMM and followed these
14 Chapter 2. Morpho-Syntactic Annotation
steps:
1. Creation of a set of tags,
2. Employment of Buckwalter's stemmer to stem Arabic text from the used corpus
which contains 9.15 MB of native Arabic articles,
3. Manually correction of tagging errors,
4. Design and construction of an HMM-based model of Arabic POS tags,
5. Training of the developed POST on the annotated corpus.
The proposed method achieved an F-measure score of 97%.
Another tagging system for Arabic POST was proposed by [Hadj 2009]. The POS
tagging task of the system is based on the sentence structure and combines morphological
analysis with HMM:
- The morphological analysis was aimed to reduce the size of the tags lexicon by seg-
menting words in their prexes, stems, and suxes.
- The HMM was used to represent the sentence structure in order to take in account
the logical linguistic sequencing.
Each possible state of the HMM presents a tag. The transitions between those states
are governed by the syntax of the sentence.
The training corpus data is composed of some old texts extracted from Books of third
century. These data were manually tagged using the developed tagset. The system
evaluation was based on the same corpus. The obtained result achieves a recognition rate
of 96%, which is considering as very promising compared to the size of tagged data.
In addition to the statistical approaches, recent applications tend to explore the use of
a machine learning methods to handle the Arabic morphology and POS tagging process.
Indeed, a memory-based learning approach was developed by [Bosch 2005] in order to
morphologically analysis and part-of-speech tagging of written Arabic. The learning
classication task in the memory-based learning was performed by employing the k-
nearest neighbor classier that searches for the k nearest neighbors. The memory-based
learning, which is a supervised inductive learning algorithm, treats a set of labeled training
instances as points in a multi-dimensional feature space. Then, stores these instances as
such in an instance base in memory. Furthermore, [Bosch 2005] employed a modied
2.2. Morpho-syntactic Annotation Methods for MSA Language 15
value dierence metric distance function to determine the similarity of pairs of values
of a feature. The used metric explores the conditional probabilities of the two values
conditioned on the classes to determine their similarity.
In order to train and test the developed approach, [Bosch 2005] exploited the Arabic
Treebank 1 (version 2.0) corpus which is consisting of 166,068 tagged words. Evaluat-
ing the morphological analyzer was based on predicting the part-of- speech tags of the
segments, the positions of the segmentations, and all letter transformations between the
surface form and the analysis. The obtained results in term of precision, recall, and F-
score are consequently 0.41, 0.43 and 0.42. Also, the POS tagger attained an accuracy of
66.4% on unknown words, and 91.5% on all words in held-out data.
Furthermore, combining the morpho-syntactic analysis generated from the morpholog-
ical analyzer and the part-of-speech predicted by the tagger yields a joint accuracy of
58.1%. This accuracy represents the correctly predicted tags and corresponds to the full
analysis for unknown words. The main limitation of the memory-based learning approach,
as concluded in [Bosch 2005], was its inability to recognize the stem of an unknown word
and accordingly the appropriate vowel insertions.
Another approach combine statistical and rule-based techniques was introduced by
[Khoja 2001] in order to construct an Arabic part-of-speech tagger. First, the developed
approach is based on: The use of traditional Arabic grammatical theory to determine
rules used while stemming a word. These rules are used to determinate stem or root by
removing axes (prexes, suxes and inxes). Second, this approach used lexical and
contextual probabilities. The lexical probability is the probability of a word having a
certain grammatical class. Whereas the contextual probability is the probability of one
tag following another tag. These probabilities are calculated from the tagged training
corpus.
The used method consists on searching a word in the lexicon to determine its possible
tags. The words not found in the lexicon are then stemmed using a combination of axes
to determine the tag of the word. Finally, in order to disambiguate ambiguous words and
unknown words, [Khoja 2001] used a statistical tagger that was based on the Viterbi
algorithm [Jelinek 1976].
In order to train the tagger and to construct the lexicon, [Khoja 2001] used a corpus
that contains 50,000 words in Modern Standard Arabic extracted from the Saudi Al-
16 Chapter 2. Morpho-Syntactic Annotation
Jazirah newspaper. That is manually tagged. The constricted lexicon contains 9,986
words. In order to test the developed tagger, four corpora (85159 words) were collected
from newspapers and papers in social science. In addition to MSA words the test corpus
contained some colloquial words. The statistical tagger achieved an accuracy of around
90% when disambiguating ambiguous words.
Furthermore, [Khoja 2001] used Arabic dictionary (4,748 roots) to test the developed
stemmer. The obtained result in terms of accuracy achieves 97%. As well the unana-
lyzed words are generally foreign terms and proper nouns, as incorrect writing words,
[Khoja 2001] conclude that employing a pre-processing component could solve the prob-
lem.
2.3 Morpho-syntactic Annotation Methods for Di-
alects Arabic Language
In [Maamouri 2006b] a description of a supervised approach was used to annotate di-
alect Arabic data. Indeed, a word list of the Levantine Arabic Treebank (LATB) data
was used to manually annotate the most frequent surface forms. Then, perform the pat-
tern matching operations to identify potential new prex-stem-sux combinations among
the remaining unannotated words in the list.
Furthermore, the Morphological/Part-of-Speech/Gloss (MPG) tagging included morpho-
logical analysis, POST, and glossing.
The evaluation of the developed system was based after and before by using dictionary.
The evaluation result shows that there is more than 10% reduction of annotation error.
Another supervised approach in [Duh 2005] was designed for tagging the dialectal Ara-
bic using LCA data. The developed system is based on using a statistical trigram tagger
in the form of a HMM and Baseline POST. Indeed, a statistical modeling and cross-
dialectal data sharing techniques were used to enhance the performance of the baseline
tagger. The adopted approach requires only original text data from several varieties of
Arabic and a morphological analyzer for MSA. So, there is no dialect-specic tools were
used.
To evaluate the developed system, [Duh 2005] compare between the obtained results
with those received ones when using:
2.3. Morpho-syntactic Annotation Methods for Dialects Arabic Language17
- a supervised tagger which were trained on hand-annotated data.
- a state-of-the-art MSA tagger applied to Egyptian Arabic.
As a result, there is 10% improvement of the ECA tagger.
Additionally to the supervised approaches, other projects tend toward the use of un-
supervised approach for example [Chiang 2006].
An Arabic dialects parser was described in [Chiang 2006] where three frameworks were
constructed for leveraging MSA corpora in order to parse LA. This process was based
on knowledge about the lexical, morphological, and syntactic dierences between MSA
and LA .
[Chiang 2006] evaluated three methods:
• Sentence transduction : in which the LA sentence to be parsed is turned into an
MSA sentence and then parsed with an MSA parser;
• Treebank transduction : in which the MSA treebank is turned into an LA treebank;
• Grammar transduction : in which an MSA grammar is turned into an LA grammar
which is then used for parsing LA.
The used MSA treebanks data, comprises 17,617 sentences and 588,244 tokens. Indeed,
it is composed of four dierent lexicons: Small lexicon with uniform probabilities, small
lexicon with EM-based probabilities, big lexicon with uniform probabilities, and big lex-
icon with EM-based probabilities.
To evaluate the developed parser, [Chiang 2006] used data that comprise 10% MSA tree-
banks and 2051 sentences and 10,644 tokens from the Levantine treebank LATB. The
major limitation in this method as concluded in [Chiang 2006] is the non employment of
a demonstration of cost-eectiveness.
Another approach described in [Al-Sabbagh 2012] used a function-based annotation
scheme in which words are annotated based on their grammatical functions. Indeed, the
grammatical categories in which morpho-syntactic structures and grammatical functions
could be dierent from each other. The developed method was based on the implemen-
tation of a Brill's Transformation-Based POS Tagging algorithm.
The developed tagger was trained on a manually-annotated Twitter-based Egyptian Ara-
bic corpus which is composed of 22,834 tweets and contains 423,691 tokens. In order to
evaluate the developed POS tagger, ten cross validation were performed. The obtained
18 Chapter 2. Morpho-Syntactic Annotation
results in term of F-measure are 87.6% for the task of POS tagging without semantic
feature labeling and 82.3% for the task of POS tagging with tokenization and semantic
features. The problem faced while analyzing is related to the three-letter and two-letter
words which are highly ambiguous and they can have multiple readings based on the
short vowel pattern.
Example:
The word ‰g
F
can be analyzed by the tagged as a:
Noun: meaning grandfather or seriousness.
Adverb: meaning seriously.
To solve this problem, [Al-Sabbagh 2012] concluded that a word sense disambiguation
modules is fundamental to improve performance on highly ambiguous words.
2.4 Conclusion
In this chapter, we introduced methods of morpho-syntactic annotation for MSA and
dialectal Arabic. The employment of approaches in morpho-syntactic annotation task
such as statistical and learning approach depend on the used resources data. Indeed, the
unsupervised techniques are not suitable for poor language resources such as the dialectal
languages. Therefore, the POS Tagging process for colloquial Arabic still needs more
improvement in term of corpus collection and annotation.
Part II
Proposed Method
Chapter 3
Data Collection and Transcription
3.1 Introduction
The transcribing process consists of two basic steps: The rst one is performed to pro-
vide voice data in order to be transcribed later. The second step consists in transcribing
the voice data by following directives that we have established. Indeed, to allow a better
representation of spontaneous speech phenomena, these directives take in consideration
the TD transcript specication. More details about those two steps are described in the
following sections.
3.2 Speech Collection
The aim of this section is to provide speech data which is the rst step in corpus
creation. Choice of the speech data content and type is very important and could be
the key of further use of our corpus. That is to say, we choose to provide both audio
and video speech to improve the use of our corpus in new research tendency especially in
video annotation [Kipp 2011].
Furthermore, including dierent TD (Sfaxian dialect, Sahel dialect, etc) will improve
the representativeness of the TD in our corpus. To provide speech data, we used broadcast
conversational speech to be the main source of speech data in our corpus as used in these
two projects [Lamel 2007] and [Belgacem 2010]. These streaming are generally radio and
television talk shows, debates, and interactive programs where the general public are
invited to participate in discussion by telephone.
In general, the commune conversational dialect in Tunisia is the dialect of the capital,
the same used in national TV and radio stations and by the majority of educated people.
Consequently, we have allocated the largest part of our corpus to this dialect.
22 Chapter 3. Data Collection and Transcription
Providing speech data with a variety of themes will increase the size of the vocabulary
in our corpus and will be very useful for further application for example theme classica-
tion [Bischo 2009]. Indeed, we dened the following theme's list in our data selection:
Religious, Political, Cooking, Health, and Social. The latter could include record-
ings that refer to more than one theme. We also dene the Other tag to assign other
types of themes.
Figure 3.1 shows the proportion of each theme in our corpus.
Figure 3.1: The proportion of themes in the corpus
Having a good amount of spoken recordings is fundamental in the design of the corpus.
Also, a high sound quality is required and will be useful for other future processing
for example in voice recognition system. In addition, we provided both individual and
multiple speakers in our collection to identify dierent aspects of conversational speech.
In table 3.1 a description of each transcription le in our corpus. The transcribed les
achieved a duration of 1 hour and 25 minutes and 37 seconds.
The collected data les generally have a long duration that exceeds fteen minutes; to
simplify the transcription task we split these records in order to obtain sequences with
3.2. Speech Collection 23
Table 3.1: Corpus les content
File Duration (sec) Number of Speakers Size (megabyte) Type
01 758 1 12,1 Video
02 867 3 13,9 Video
03 216 12 19.1 Audio
04 232 13 20.5 Audio
05 204 9 18.0 Audio
06 360 8 31.8 Audio
07 900 2 14.4 Video
08 364 3 5.8 Video
09 477 2 7.6 Video
10 759 2 12.2 Video
duration between ve and fteen minutes. Then we convert them to the MP3 format to
be adapted to the entry of the used transcription software.
In order to provide more details about our speech data collection, we provided the
references.xml le which contains a description of each le in our corpus. The XML
has been used to represent the data structures in this le. Such representation allows a
simple preview for users and an easier integration in feature annotation system. Adding
to that, we assigned a DTD le that we named references.dtd to validate the previous
XML le.
In Table 3.2 we nd a description of all dierent tags used in the references.xml le.
24 Chapter 3. Data Collection and Transcription
Table 3.2: Description of tags used in the references XML le
Label Type Name Unit Description
ID Integer Identier  Identier of record
NAM String Name  Name of record
DUR Integer Duration sec Duration of record
TOP List Topic  Topic of record
NBMASP Integer Number of
male speaker
 Number of
male speakers in a record
NBFESP Integer Number of
female speaker
 Number of female speakers
in a record
FISI Float File Size megabyte File Size of a record record
SOFITY List Source File
Type
 Source File Type of a record
(TV or Radio)
SONAM String Source Name  Source Name of a record
SOFIEX String Source File
Extension
 Source File Extension of a
record
SOFISI Float Source File
Size
megabyte Source File Size of a giving
record
SOTY List Source Type  Source Type of a giving
record (Audio or Video)
SODA Date Source Date  Source Date of a record
SOLI String Source Link  Source Link of a record
3.3 Transcription Process
The speech annotation process includes the segmentation of the sound track, the iden-
tication of turns and speakers, and the orthographic transcription. We applied these
steps in a parallel manner taking in consideration notes in Orthographic Transcription
of Tunisian Arabic [Zribi 2013] and in the Directive of Transcription and Annotation
of Tunisian Dialects.
The rst voice le (duration of 12:38 sec) was transcribed in the beginning with the
Speech Assessment Methods Phonetic Alphabet (SAMPA) for Arabic. We thought that
using SAMPA for Arabic will allow a better representation of phonetics. However,
during transcribing process we nd that it is better to use the Arabic script including
diacritics instead of the SAMPA for Arabic. So we adopt the Arabic transcript in the
3.3. Transcription Process 25
transcription process.
Transcribing Arabic spoken recordings is a very long task, especially when using Arabic
script and diacritics. For example, a recording that lasts 3 minutes 52 sec, yet took more
then 4 hours. Actually, the transcribing of each minute takes at least one hour; 15-
20 minutes for the identication of turns and speakers, 40-45 minutes for transcribing,
following rules in our directives.
3.3.1 Transcription Tools
There is a variety of transcribing tools (SoundIndex, AMADAT, XWaves, etc) for voice
data. We selected Transcriber [Barras 2000] and Praat [D.Weenink 2013] to handle the
transcribing task. The software Transcriber was adopted according to the following
advantages:
Simple user interface:
• Supports many languages (including English, French and Arabic).
• Easier way to manipulate the voice interval.
• Supports the use of keyboard shortcuts in annotation.
Very rich in term of annotation:
• Dened annotation events (noise list, lexical list, named entities list, etc).
• Possibility of edit or add additional annotation.
Input le Flexibility:
• Accepts long speech File duration.
• Supports various le formats (au, wav, snd, mp3, ogg, wav, etc).
Better output representation:
• Supports many types of encoding (UTF-8, ISO-8859-6, etc).
• The output File follows a description schema.
Concerning Praat software, the choice was related to our ANLP research group needs.
In addition, Praat gives a better representation of speech overlapping and allows speech
analyzes.
26 Chapter 3. Data Collection and Transcription
3.3.2 Transcribing Guidelines
Transcription guidelines are made in order to be followed during the annotation process
of our speech data. We elaborated the Orthographic Transcription of Tunisian Arabic
directives [Zribi 2013] which are adapted from the Enriched Orthographic Transcription
(TOE in French) [Bigi 2012]. To deal with the TD, some rules have been modied or
removed. As the Standard orthographic transcription doesn't take into consideration the
observed phenomena of speech (elisions, disuency, liaison, noise, etc), we enriched our
directives with them.
The Directive of Transcription and Annotation of Tunisian Dialects directive was
made to give additional annotation concerning phonemic and phonetic in speech data.
Indeed, examples were included to show the application of these rules in the Transcriber
software. The directive was adapted from the ESTER2 convention [Cam 2008]. The
taking directive took in consideration the specicity of the Arabic language and the TD.
The following is a description of some conventions:
a) The identication of turns and speakers
- Sections :
We dened two sections in the audio document: the relevant section identied by
the title report and the not transcribed section identied by the title nontrans.
The sections not transcribed contains more than fteen seconds such as:
• Advertising, weather passages, generic emission,
• Applause,
• Music, songs,
• The beginning or the end of another show dierent from the current emission,
• Silence.
The other sections are relevant, so they are the only sections that we segment and
transcribe. The other sections are relevant, so they are the only sections that we
segment and transcribe as illustrated the example in Figure 3.2. The concept of
Sections is absent in Praat software, so we don't take in consideration these roles
while using it.
- Turn-taking
First, we identify each speaker involved in the audio document.
There are two types of speakers:
3.3. Transcription Process 27
Figure 3.2: Sections and Turns in Transcriber
• Global: 1
speakers identied by the syntax First Last name.
• Local: 2
speakers identied with the syntax First Last name if possible.
Otherwise, we notice them by the syntax speaker # n; where n is the number
1 to n corresponding to the speakers order.
The same speaker must always appear with the same identier, also, the list of
speakers must contain only speakers involved in the audio document. In addition,
we complete all information relative to these speakers such as gender and dialect.
Second, we attribute the name of speaker to the Speech Turn. Although, speech
turns that do not contain speaker speech, are identied by the syntax no speaker.
The example in Figure 3.3 illustrates how to manage speakers in Transcriber soft-
ware.
To solve the problem of speech overlapping, we adapted the solution mentioned in
[Barras 2000] where we create a new speaker that we named with the syntax First
Last name speaker 1+ First Last name speaker 2.
Praat software gives a better representation of speech overlapping. In fact, the
1It is common to several audio speakers, such as the presenter, journalists, etc
2These are unknown speakers that intervene by telephone for example.
28 Chapter 3. Data Collection and Transcription
Figure 3.3: Manage speakers in Transcriber
script of each speaker is presented separately in an individual interval as shown in
Figure 3.4.
- Silence:
Silence could be at the beginning, mixed with the transcript, and at the end of a
speaker Turn. To avoid such problem, we isolated silence and noise that is beyond
0.5 seconds in a speaking turn no speaker. Also, silence beyond 0.2 at the begin-
ning of a Speaker Turn is isolated in a segment of a speech turn no speaker or
integrated directly in the last speech turn. In addition, silence above 0.2 seconds
at the end of a segment speaker was isolated in a segment of a speech turn no
speaker or integrated in the upcoming segment of the speech turn no speaker.
Eventually, we added the hashtag symbol # when we have silence between 0.1
and 0.2 seconds in the relevant turn.
- Segments:
A relevant segment contains an intervention of a speaker and must have a minimum
of syntactic and semantic consistency. If a segment of a speech term exceeds fteen
seconds we redistributed it in relevant segments.
3.3. Transcription Process 29
Figure 3.4: Manage speakers in Praat
b) Orthographic transcription
To transcribe the TD, we use the modern standard Arabic transcription rules that
do not aect the characteristics of the dialect. We dene also a set of rules that allow
transcription of the words in Tunisian Arabic, based on its phonology.
- Transcription of the Hamza:
Transcribe the Hamza only if it is pronounced. Use only one of these forms of
Hamza:

d,

d, d

. If the absence of the Hamza in the dialect word causes ambiguity,
we should transcribe it.
30 Chapter 3. Data Collection and Transcription
- Transcription of ta marbuta:
The ta marbuta (

è) should be written at the end of the word whether it is pro-
nounced /a/ or /t/.
Example:

ék e

©
®

u (an apple), É
©
®¢Ë d

éƒ d

» (book of child)
- Code switching:
MSA, TD and foreign languages coexist in the daily speech of Tunisian people.
The transcription of the MSA words should respect the transcription conventions
of the Arabic language.
The foreign words and MSA words should be written respectively using this form:
[lan:X, text or word, SAMPA pronunciation] and [lan:ASM, text or word]. We use
the SAMPA of the specic language for writing the speaker's pronunciation.
Example:
• MSA word: [lan:MSA, r
F
e

t» ]
• French word: [lan:Fr, informatique, ?anfurmati:ku]
• English word: [lan:En, network, na:twirk]
- Atypical accords:
We choose the standard orthographic transcription of the words as they have been
said.
Example:
ù

ë eu
F

é« g
©
m
Ì
9d (the holidays are wonderful) instead of

ét

ë eu
F

é« g
©
m
Ì
9d
- Personal pronouns:
Pronouns must be transcribed as found in this list:
e
©
u d (I), ú


æ
©
u d

(You), Õ

æ
©
u d

/ eÓñ

t
©
u d

(You), e
©
tm

©
9
/ e
©
tk

d (We), ñë (He), ù

ë (She), eÓñë (They)
- Names of months and days:
The names of the months and days must be transcribed as in the MSA language.
- Axes and clitics:
The Table 3.3 lists the dialectal clitics.
Note: the  È d should be written even if it is pronounced l.
This rule is practical also for transcribing words which start with a sun or moon
letter.
- Named entities:
3.3. Transcription Process 31
Table 3.3: Clitics in the TD language
Clitic Enclitic
Pronominal enclitic ¼, è, ð, eë, Ñë, Õ», e
©
u, ø

Negation enclitic 
€
Interrogation enclitic ú

æ

…
Proclitic ð, È, r
F
, ¼, ¨, È d, Ð
We used these representations for annotating  t
%

…
, PERSON NAME ,  ½Ó ,
Place name.
Examples:
t
%

…
, ú

q
F
m
F
Ì
9d ˆ eÔ
«

- Characters:
The following phonemes /v/, /g/ and /p/ does'nt exist in the Arabic language. To
transcribe them we add  '  after these letters.
- Incorrect word:
When the speaker replaces a letter with an incorrect one, we keep the original letter
and we add to it the corresponding correct one. We have to represent these updates,
as the following example:
x{Correct letter, Original letter}x, x is a letter of the word.
c) Rules of marking
Transcribe what is heard (hesitation, repetition, onomatopoeia, etc). The transcript
should be close to the signal.
- Noise:
We insert the tag [i] to indicate breathing-inspiration or [e] to indicate breathing-
expiration of the speaker. Insert tag [b] to indicate a noise:
• Mouth noises (cough, throat noise, laughters, kisses, whisper, etc),
• Rustling of papers,
• Microphone noise.
Insert tag [musique] to indicate music.
32 Chapter 3. Data Collection and Transcription
- Punctuation:
Punctuate the text with only those punctuations: ., !, ?
3.4 Conclusion
A mixture of streaming TV and radio station programs have been collected and adapted
crossing the speech data collection process described in this chapter. As a result, 10
les with a more than 1 hour and 25 minutes were transcribed following our guidelines
transcription. In fact, the transcribing process was the most laborious stage in the project.
During this process we faced some problems with the used transcribing tools such as the
slowness of Praat interface while transcribing long audio duration and the inappropriate
updating of Arabic script transcription in the Transcriber interface.
Chapter 4
Morpho-syntactic Annotation Method
4.1 Introduction
Speech and text resources are very rare which is seen as an obstacle in developing
application in this eld. In this context, our project will contribute in providing resources
by constructing a morpho-syntactic annotated speech corpora for the TD.
During this chapter, we will focus on dierent phases of the annotation task which
aims to identify the grammatical class of each word. Our method consists in integrating
dierent tools and resources to annotate the TD language words. Indeed, we choose two
morphological analyzers (MADA and Al-Khalil TD version analyzer) to analyze MSA
language words. We use also Al-Khalil TD version and TD dictionary for analyzing TD
words. We will present in the rst section a global view of our method and the dierent
steps that we have followed to achieve the morpho-syntactic annotation task. Then, we
will explain each step in this process with a full description.
4.2 Our main Method
In our method, morpho-syntactic annotation process follows dierent steps as described
in Figure 4.1. The process starts by extracting speakers' text and some useful word
annotations (word language, word named entity, etc) from the transcription le. These
pieces of information are then saved in another le with a specic structure to be used
in the next analyzing step.
Then, we used two morphological analyzers (MADA and Al-Khalil TD version) and a
dictionary depending on each word characteristics (word language, word Onomatopoeia,
etc) and by applying rules that we have established to judge the suitable grammatical
class for each word. Finally, after the analysis's conrmation by the user, we save the
result of the morpho-syntactic annotation in an XML le respecting a specic structure.
As a result, each word is assigned with a tag that indicates its grammatical class. More
details about these steps are introduced in the next sections.
34 Chapter 4. Morpho-syntactic Annotation Method
Transcription File
Preliminary Step
Segmented File
Word Analysis
Choosing Result
Result File Generation
Annotated File
Tunisian Dialect Dictionary
MADA Analyzer
Al-Khalil TD version analyzer
Figure 4.1: Morpho-syntactic annotation steps in our method
4.3 Preliminary Step
The purpose of the preliminary step is to import speaker information and their speech text
with all useful annotation (named entities, word language, ..). Indeed, these annotations
vary depending on the used transcription tool (Transcriber or Praat).
In the Table 4.1, we list the information extracted from annotation in the transcrip-
tions les.
The process of information extraction is summarized in the following steps.
1. Collect speech text: Speech text for each speaker is divided into many Speech Turns,
so we gather them in a unique text for each speaker.
2. Speech text cleaning up: Speech text includes many annotations; some of anno-
tations are very useful in the morpho-syntactic annotation process. Some other
annotations such as noise and music are removed because they are not useful in
this task.
4.3. Preliminary Step 35
3. Split speech text to sentences: Using punctuation annotation (` !',`.',` ?') we divide
speech text into a list of sentences.
4. Extract words annotations: Words annotations (Pronunciation notation, Disuen-
cies, Named entities, Word language) described in the Table 4.1 are extracted from
the transcription le then will be used in the morpho-syntactic analysis process.
5. Generate Segmented Text le: After extracting useful annotation from the tran-
scription le, we generate a structured le to be used in the next step.
In Table 4.2 we present a description of each tag we have used in the segmentation
text le.
Table 4.2: Description of tags used in the segmented text le
Tag Name Description
Text Text The text of a transcribed le
sp Speaker
Speech
The text of each speaker in a transcribed le
s Sentence Sentence from a speaker's speech
w Word A word from a sentence
ponct Punctuation The punctuation of a given sentence could be one
of these three values: . or ! or ?.
Example:
s id=sID1 ponct=. 
id id In the Text element, the id attribute identify
the le an have as a value the name of the le.
For the other elements, id identify each ele-
ment by following a specic codication: Element
tag+ID+Number
Example:
w id=wID10 ...
s id=sID9 ...
Continued on next page
36 Chapter 4. Morpho-syntactic Annotation Method
Table 4.2  continued from previous page
Tag Name Description
na Named
entity
The attribute na is used to identify words which
are the name of person or place.
Example:
w id=wID1613 na=B_N 
Value 
0

ÖÞ
…
/Value
/w
w id=wID1614 na=I_N 
Value ú

q
F
m
F
Ì
9d /Value
/w
elis Elision The attribute elis is used to identify a word which
contains elision. In fact, the attribute elis con-
tains the word with an elision between parentheses
and the Value element contains the correct word.
Example:
w id=wID30 elis=
©
‰
©
g(

d) ©
რ
Value
©
‰
©
g

e
©
tƒ/Value /w
wLang Word
Language
The default language of words in the transcribed
text is the TD. All other languages are considered
as foreign. Indeed, we take in consideration three
possible values for foreign language:
Fr: French language
En: English language
MSA: Modern Standard Arabic language
wTrans Word
Transliter-
ation
The transliteration of a foreign word writing in
SAMPA pronunciation.
Example:
w id=wID1612 wLang=fr wTrans= 
Valueprofesseur/Value
/w
Continued on next page
4.3. Preliminary Step 37
Table 4.2  continued from previous page
Tag Name Description
hesi Hesitation The attribute hesi identies an hesitation word.
Example:
w id=wID1980 hesi=

d /
onom Onomato-
poeia
The onom attribute identies an Onomatopoeia
word.
Example: w id=wID41 onom= d /
The Figure 4.2 present an example of a segmented le.
Figure 4.2: Example of the Segmented Text
1 ?xml version='1.0' encoding='UTF−8'?
2 Text id=012 
3 sp id=1
4 s id=1 ponct=. 
5 w id=1 wLang=Fr wTrans=madame 
6 Valuemadame/Value
7 /w
8 w id=2 na=S_N 
9 Value

éÒ£ e
©
¯/Value
10 /w
11 w id=3 
12 Valueú

©
¯/Value
13 /w
14 w id=4 
15 ValueÉ’
©
¯/Value
16 /w
17 w id=5 hesi=È d /
18 w id=6 
19 Value
©
­t

’Ë d/Value
20 /w
21 !−− the remainder of words −−
22 /s
23 !−− the remainder of sentences −−
24 /sp
25 !−− the remainder of speakers −−
26 /Text
38 Chapter 4. Morpho-syntactic Annotation Method
4.4 Word analysis
To begin with, the characteristics of word are used to assign it with the relative tag.
Indeed, words that have characteristics as onomatopoeia, named entities and word
language (French or English) are assigned respectively with the tags Onomatopoeia,
Named entity and Not-Recognized.
Then, we used two analyzers (MADA and Al-Khalil TD version) and dictionary,
according to the rules dened below, to identify the grammatical class of each word. Two
ways of process may arise according to the language of each word: Tunisian dialect words
and modern standard Arabic words. We detail in the following these ways.
a) Tunisian Dialect Words
Figure 4.3: Analyzing process of a TD word
Analyzing a TD word follows dierent steps as described in Figure 4.3. Indeed, to
analyze a TD word, we start analyzing it by using TD dictionary as described in Figure
4.4. If this process does not give any analysis, we analyze the given word by using
4.4. Word analysis 39
Al-Khalil TD version analyzer. If the analyzed word is neither recognized by the TD
dictionary or Al-Khalil analyzer, we remove its diacritics and we reanalyze it with the
TD dictionary.
Analyzing a word without diacritics allows us to solve the problem of dierence in
writing diacritics of same words. For example word  ©
Ẃ)

 could be written  ©
Ẃ)

 or
 ©
Ẃ)

 according to the dialect of the speaker.
If these process do not lead to any analysis, we reanalyse the given word without
diacritics by using Al-Khalil TD version analyzer. Finally, if there is no possible analysis
we analyze the given word with MADA analyzer. Indeed, many words without diacritics
are written at the same as in MSA or TD. Thus, analyzing these words with a MSA
analyzer could succeed in getting possible analysis.
During this process, if there is more than one analysis we precede to organize them.
When there is no possible analysis for a given word in our method, we assign it with
the tag unknown. Furthermore, our system allows the user to interfere by choosing the
correct one or by adding a new analysis. In the following Figure 4.4, more details about
the used procedure mentioned above.
Figure 4.3 illustrates these steps. Furthermore, our system allows the user to interfere
by updating analysis or by adding other analysis.
Analyzing with TD Dictionary:
Analyzing a word using TD dictionary is handled by applying a morpheme segmenta-
tion method as used in [Yang 2007]. Figure 4.4 shows the dierent steps which had been
taken to analyze a word with the TD dictionary. These steps are executed sequentially
until we get an analysis result.
First, we search in the TD dictionary according to this order: Conjunctions, Pronouns,
Number words, Interjections, Particles, Adjectives, Adverbs, Nouns, Verbs. Second, we
look sequentially at Adverbs, Nouns, Verbs dictionary through trying all possible TD
prexes. Third, we do the same task with suxes. Finally, we resume the same procedure
with prexes and suxes.
Analysis Ranking:
We classify and conrm each word analysis according to the following order: TD
Dictionary, Al-Khalil TD version analyzer, MADA analyzer. Furthermore, if these tools
and resources give the same analysis, we keep one of them.
40 Chapter 4. Morpho-syntactic Annotation Method
Figure 4.4: Analyzing a word with TD dictionary
b) Modern Standard Arabic Words
We used two analyzers MADA and Al-Khalil to analyze MSA word. First we use
MADA. Then, if there is no possible analysis, we analyze with Al-Khalil analyzer. More-
over, if there is no possible analysis we remove diacritics and we reanalyze them again
with Al-Khalil.
4.5 Choosing results
While using Al-Khalil adapted analyzer and Dictionary. A problem of having many
analysis for the same word could appear. This problem is caused by the dierence between
word diacritics writing or by ambiguity.
4.5. Choosing results 41
Problem with Al-Khalil analysis:
Usually, Al-Khalil adapted analyzer return a list of analysis for a given word with dif-
ferent information (gender, prex, sux, gender, number, person,voice, etc). In general,
this problem which is related to the ambiguity in Arabic language.
Problem while using dictionary:
The problem in dictionary analysis is related to word diacritics writing. To solve
the problem, we try to classify these results by comparing their distance to the original
word. Indeed, we used the Levenshtein distance [Haldar 2011] for measuring the dierence
between two sequences or words.
Mathematically, the Levenshtein distance between two strings a, b is given by
leva,b(|a|, |b|) where:
leva,b(i, j) =



0 , i = j = 0
i , j = 0 and i  0
j , i = 0 and j  0
min



leva,b(i − 1, j) + 1
leva,b(i, j − 1) + 1
leva,b(i − 1, j − 1) + [ai = bj]
, else
• leva,b(i − 1, j) + 1: The minimum corresponds to removing (from a to b).
• leva,b(i, j − 1) + 1: The minimum corresponds to insertion (from a to b).
• leva,b(i − 1, j − 1) + [ai = bj] : The minimum corresponds to match or mismatch
(from a to b), depending on whether the respective symbols are the same.
Example:
Original word: e

©
t(



‚Ó
Word returned while analyzing: e

©
t(


‚

Ó
⇒ Distance between two words is 2 as calculated in Table 4.3.
When we have analysis from both Al-Khalil TD version analyzer and TD dictionary
(case that we have problem with diacritics writing), we used the dictionary analysis to
conrm Al-Khalil analysis if they have the same grammatical function.
Finally, we generate the annotation le as described in the following section.
42 Chapter 4. Morpho-syntactic Annotation Method
4.6 Result le generation
These tags are while generating the annotation le are presented in Table 4.4.
Table 4.4: Description of tag used in the annotation result le
Tag Name Description
asp aspect The aspect (orders or requests, perfective, imper-
fective)
vox voice The voice (active, passive, etc.)
stt state The state (indenite, denite, construct, etc.)
per person The person (1st, 2nd, 3rd)
num number The number (plural, dual, singular)
gen gender The gender (feminine, masculine)
case case nominative The case nominative accusative genitive
sux sux The sux of the word
pattern pattern The pattern of the word
root root The root of the word
stem stem The stem of the word
spee part of speech The part of speech of the word
prex prex The prex of the word
4.7 Conclusion
A lot of problems complicated the used tools integration process. These were mainly
due to the lack of dierent input/output formats of every tool and granularity of tag sets.
In addition, other problems are faced while using the analysis tools. For example, MADA
analyzer ignores all word characteristics such as the language of the word which could
aect the analysis result. Another problem occurred when Al-Khalil analyzer return
many analysis, so the user has to intervene to choose one of them.
4.7. Conclusion 43
Table 4.1: Annotations extracted from the transcription les
Pronunciation notation
- Elongation
- Liaisons
- Elisions
- Incomprehensible sequence
Disuencies
- Incomplete word
- Onomatopoeia
Named entities
- Person name
- Place name
Word language
- TD
- MSA
- French
- English
44 Chapter 4. Morpho-syntactic Annotation Method
Table 4.3: Levenshtein distance table example
1- 0 1 2 3 4 5 6 7
Ð

€

d ø

è
©
à

d d
-1 0 1 2 3 4 5 6 7 8
0 Ð 1 0 1 2 3 4 5 6 7
1

€ 2 1 0 1 2 3 4 5 6
2 d

3 2 1 1 2 3 4 5 6
3 ø

4 3 2 2 1 2 3 4 5
4
©
à 5 4 3 3 2 2 2 3 4
5

d 6 5 4 3 3 3 3 2 3
6 d 7 6 5 4 4 4 4 3 2
Chapter 5
Realization and Performance
Evaluation
5.1 Introduction
The expansion of NLP application for dialectal requires a high amount of resources
in terms of data and tools. By developing a morpho-syntactic annotation tool for the
TD language, we facilitate the morpho-syntactic annotation task which promotes to the
build of corpus.
In this chapter, we introduced the used tools and resources used in our tool. Then, we
present our TD annotation tool by explaining the dierent modules, functionality, and
by providing some details about the development environment. Finally, we experiment
our tool and we discuss the results obtained in dierent assessments.
5.2 Tools and Resources
The morpho-syntactic annotation process contains several tasks that can be handled
using tools and resources. In this section, we introduced the used analyzers and dictionary.
5.2.1 Al-Khalil analyzer
The Al-Khalil analyzer was developed to produce tags for a given text by executing a
morphological analysis of the text. The lexical resource consists of several classes that
handle vowelled and unvocalized words. The main process is based on using patterns for
both verbal and nominal words; Arabic word roots and axes.
Indeed, according to [Altabba 2010] Al-Khalil analyzer is still the best morphological
analyzer for Arabic. In addition, Al-Khalil won the rst prize at a competition by The
Arab League Educational, Cultural Scientic Organization (ALESCO) in 2010.
46 Chapter 5. Realization and Performance Evaluation
5.2.2 MADA analyzer
The MADA+TOKAN toolkit is a morphological analyzer that is introduced by
[Habash 2009] and used to derive extensive morphological and contextual information
from Arabic text. Indeed, the toolkit includes many tasks; high-accuracy part-of-speech
tagging, diacritization, lemmatization, disambiguation, stemming, and glossing. In ad-
dition, MADA classies the analysis results and gives as an output the more suitable
analysis to the current context of each word.
The analysis results carry a complete diacritic, lexemic, glossary and morphological
information. Also, TOKAN takes the information provided by MADA to generate to-
kenized output in a wide variety of customizable formats which allow an easier extraction
and manipulation. In addition, MADA achieved an accuracy score of 86% in predicting
full diacritization and 96% on basic morphological choice and on lemmatization.
5.2.3 Al-Khalil TD version
Recent research in our ANLP group [Amouri 2013] conducted the study of the dialect
language by adapting the Al-Khalil analyzer to the TD language. Thanks to the enrich-
ment of the transformation rules, the adapted analyzer achieves a score of 81.17% and
96.64% in terms of recall and accuracy for verbs correctly analyzed.
5.2.4 Tunisian Dialect Dictionary
The TD dictionary [Ayed 2013, Boujelbane 2013] were constructed using lexical units
in the Arabic Tree bank corpus and their parts of speech, to convert words from the MSA
to TD language. The obtained results consist on an XML lexical database composed
of nine dictionaries (Conjunctions, Pronouns, Number words, Interjections, Particles,
Adjectives, Adverbs, Nouns, Verbs).
5.3 Tunisian Dialect Annotation Tool
This section is dedicated to present our TD annotation Tool. Indeed, the rst part
will clarify the usefulness of our system and its functionality. The second part is divided
to clarify details, characteristics, and the development environment.
5.3. Tunisian Dialect Annotation Tool 47
5.3.1 Process of TDAT
In order to specify and visualize the artifacts of our system, we will detail its func-
tionalities and its manipulation procedure. Also, we will introduce the structure of our
system.
a- System functionalities
The principal functionality of our system is to generate a morpho-syntactitc annotation
for each word in the transcription le. Indeed, this process is composed of two basic steps;
The rst is to segment the transcription le. The generated segmented le follows a unique
XML structure which allows a better representation of the text and speech phenomena.
The second step requires as an input a segmented text le, then analyses each word by
determining its suitable grammatical class.
Our annotation system allows an easier manipulation of the obtained analysis result.
Indeed, the user could show, update, save an annotation le, and open an unaccomplished
annotated le to complete it. Additionally, the user has the possibility to select through
the available option of dictionary and morphological analyzers which will be used in the
morpho-syntactitc annotation process.
• Segment a Transcription File:
The aim of the segmenting script is to prepare the transcription le to the entry of
our TDAT. Indeed, this tool will allow our TDAT to support dierent transcription
les types. Currently, the developed tool supports two transcripts les formats (trs
and TextGrid).
The generated le (Segmented Text) follows a unique structure (XML) that
allows a better representation of speech phenomena for the user. Indeed, the used
structure was created in accordance with the TEI recommendations. Furthermore,
the segmented text allows an easier interpretation of the orthographic transcription
by our system. Also, the Segmenting tool generates three other les:
- Words' list: contains a list of all words and their frequency.
- Sentences' list: contains a list of all sentences.
- Statistics: contains some useful statistics about the content of the transcription
le (See Figure 5.1 for more details).
48 Chapter 5. Realization and Performance Evaluation
1 ?xml version='1.0' encoding='UTF−8'?
2 STATISTIC id=110 
3 Speakers3/Speakers
4 Sentence267/Sentence
5 Words3231/Words
6 Hezitation117/Hezitation
7 Onomatopoeia150/Onomatopoeia
8 Elisions102/Elisions
9 NamedEntity54/NamedEntity
10 /STATISTIC
Figure 5.1: XML Shema of the Annotated Text
In order to segment a transcription le (trs or TextGrid), the user has to open
or select a transcription le in the corpus les tree (see (2) in Figure 5.3 for more
details). Then, our tool starts the process by opening the transcription le. Indeed,
an error message appears when there is a problem while executing this process such
as unexpected le format.
Along this process, the system informs the user about the progress (See (4) in
Figure 5.3). Then, our tool leads the generated segmented text le and asks for
showing the obtained results. Finally, our system loads and shows in the statistic
menu the statistics information (number of words, sentence, speaker, hesitation,
onomatopoeia, elision and named entity) relative to the segmented le (see (2) in
Figure 5.4 for more details). The user is noticed if there is a problem while loading
the segmented le or the statistic le.
• List of Words Frequency:
To show the list of words frequency, an already segmented le must be opened
or selected in the segmented le list. Indeed, our system leads the list of words
frequency le relative to the selected segmented le. If there is a problem while
loading the list of words frequency, an error message will appear to inform the user.
• Annotation options:
To analyse a word, our system uses analyzers and a dictionnary. Those choices
could be updated before annotating a le by choosing from the available analyzers
and dictionary which one to use. In addition, the path of the used resources could
be easily updated (See Figure 5.8). These options take place after saving when a
new annotation process is launched.
5.3. Tunisian Dialect Annotation Tool 49
• Segmented File Annotation:
The annotation process starts when the user wants to annotate a transcribed le.
Indeed, the user has the choice to segment an opened transcription le in the
segmented le list or to open an incomplete annotated le. Then, switch the selected
resources option (Dictionary and analyzers), the developed system launches the
analysis process.
During this process, the analysis results appear progressively in the annotation
window (see (1) in Figure 5.6 for more details) and another window appears to
inform the user about the progress (see (2) in Figure 5.6). In addition, our system
gives the user the possibility to show a recap of the analyzed words along the
annotation process (see Figure 5.7 for more details). If a problem occurs while
analyzing a word or executing one of the used morphological analyzer tool an error
message will appear in the console. Also, the user could stop all the current process
at any time (see (2) in Figure 5.6).
When there is more than one analysis for a given word, case of ambiguity, the
user has to intervene to select and to conrm the right analysis for the word (see
Figure 5.5). After conrming all the analysis, the user could save the annotation
le, otherwise, the system saves the incomplete le by conserving all the analysis
for each word that has not been conrmed. To save an annotation le, the user
has to select le format and the result directory path. Finally, a message will
appear to inform the user that the annotation le has been successfully saved and
a new window containing the obtained result appears (See Figure 5.9). Although,
a message will appear to inform the user that there is a problem while saving the
annotation le.
• Update Analysis Results:
The user has the possibility to update the analysis results while or after analyzing.
To update the analysis results, the user has to select the appropriate grammatical
class of a given word from the analysis list returned by our system. In addition, the
user could add a new analysis by typing the additional information such as prex
and sux (See Figure 5.5).
After conrming the new analysis, the system updates the annotation window
by adding the new tag in the top of the relative word analysis list. Indeed, the new
50 Chapter 5. Realization and Performance Evaluation
added analysis is considered as the best analysis so there is no need to conrm it
later.
b- System Collaborations:
Our system collaborates with other tools to generate the more suitable grammatical
class for each word. Indeed, our system interacts with:
• MADA analzyer: to analyze a text.
• Al-Khalil Analyzer: to analyze word.
• Al-Khalil TD analyzer: to analyze word.
• Perl Script: to segment a transcription le.
5.3.2 Realization
We choose JAVA programming language to develop our system for many reasons such
as:
• JAVA is one of the most popular programming languages in use
1
in 2012 thanks
to its simplicity.
• It is a platform-independent at both the source and binary levels.
• It allows creating modular programs which allows an easily use of predened struc-
ture of other projects in particular Al-Khalil source code.
Furthermore, we used multi-threaded to perform several tasks simultaneously especially
while performing the annotation process. By using this technology, rst of all, we in-
creased the processing speed. Second, we allowed a direct result display which allows the
user to intervene to conrm the returned analysis instead of waiting the end of the whole
annotation process.
The selection of the development environment Eclipse will allow us to program simul-
taneously with several programming languages in particular PERL and Java. Besides,
its extensibility in terms of programming language, this multi-platform environment is
already used in the development of Al-Khalil analyzer. Indeed, we conserved the same
project characteristics such as the le text encoding (Cp1256).
1http://en.wikipedia.org/wiki/Java_(programming_language)
5.3. Tunisian Dialect Annotation Tool 51
In addition, we chose to work in the LINUX system environment to take advantage
of the speed. Also, this will allow a better performance of MADA analyzer which work
basically in this environment. Indeed, MADA analyzer is completely build in PERL
programming language.
To manipulate the segmented le and the annotation le, we used the XPath ex-
pression language. Indeed, the XPath language is based on a tree representation of the
XML document, and provided the ability to navigate around the tree by selecting nodes
with a variety of criteria. The XPath was dened by the World Wide Web Consortium
(W3C) and its use in our environment requires the JDOM package. Thus, we imported
the JDOM (version 2.0.0) library to our project.
Through the use of Perl programming language, processing large le size does not
exceed one second. Indeed, PERL provides many predened methods that allows an
easier manipulation and generation of text le and considered as the leader in the eld
of text le processing. Updating Perl code is also quite easy and avoids the update of
the whole application.
Furthermore, PERL is a multi-platform language and usually pre-installed in the
LINUX system. As we used MADA analyzer, using this programming language will
perform our TDAT and do not need any extra requirement. Also, we added EPIC
plug-in to the development environment Eclipse in order to manipulate PERL script.
5.3.3 TDAT Interfaces
Main Interface
The main interface is divided into four parts:
Main menu:
The purpose of the main menu (1) in Figure 5.2 is to provide an easy access to all
used les while using our application. Indeed, the menu is classied according to the
format of the les:
• File: File management
• Transcription: Management of the transcriptions les.
• Segmented Text: Management of the segmented text les.
52 Chapter 5. Realization and Performance Evaluation
Figure 5.2: Main Interface in the TDAT
• Annotation: Management of the annotation les.
Speed access menu:
The main purpose of the speed access menu (2) in Figure 5.2 is to easily access and
use the basic functions of our application that is why we classied these functions in four
themes:
1. Corpus
2. Transcription
3. Annotation
4. Statistics
Transcription interface
The content of a transcription le could be visualized (see (1) in Figure 5.3) through
the use of the corpus tree (see (2) in Figure 5.3). Indeed, the corpus tree contains all the
transcription les in the corpus.
5.3. Tunisian Dialect Annotation Tool 53
Figure 5.3: Transcription window in the TDAT
Segmented text interface
To start segmenting (see (3) in Figure 5.3) a transcription le, the user has to choose a
le from the corpus tree. The segmentation process is presented as in Figure 5.3. When
the process is accomplished a new window that contains the segmented text appear (see
(1) in Figure 5.4). In addition, the statistics menu (see (2) in 5.4) shows the content of
its relative statistics le.
Add analysis interface
The user could update analysis results by selecting the right grammatical class for each
word (see Figure 5.6). Furthermore, the user could add a new analysis for a given word
by choosing the grammatical class and writing its prex and sux in the Add analaysis
interface (see Figure 5.5).
Analysis interface
The analysis in the Figure 5.6 appears progressively. As well, the status icon changes ac-
cording to the analyze advancement. Also, another window appears to show the progress
of the used options (see (2) in 5.6).
54 Chapter 5. Realization and Performance Evaluation
Figure 5.4: Segmented Text window in the TDAT
Figure 5.5: Add an analysis result window in the TDAT
The Table 5.1 describes each icon signication:
Table 5.1: Description of icons used in the annotation interface
5.3. Tunisian Dialect Annotation Tool 55
Figure 5.6: Analyze window in the TDAT
Icon Description
No possible analysis
One analysis
Many analysis
Add an analysis
Conrm an analysis
Analysis given by MADA analyzer
Analysis given by Al-khalil analyzer
Analysis given by TD Dictionary
An analysis proposed by the user
An Analysis extracted from the transcription
56 Chapter 5. Realization and Performance Evaluation
Analysis details interface
Figure 5.7: Analyse Details window in the TDAT
The statistic button in the annotation interface (see (3) in Figure 5.6) gives the user
details about the recognized words (see Figure 5.7). Indeed, these statistics are updated
automatically while the analysis process.
Annotation options interface
Analysis result le interface
To save the analysis, the user has to choose the le format and to enter the result le
name and path. The generated annotation le is showed in gure 5.9.
5.4. Evaluation 57
Figure 5.8: Annotation options window in the TDAT
5.4 Evaluation
Evaluate a morpho-syntactic annotation system allows to determine its capabilities
and diagnose its strengths and weaknesses. Thus, the evaluation process requires a lot of
objectivity.
The most known evaluation method is to compare the performance of the developed
system with other similar systems. Such system must has the same input and output.
Also, the capability of analyzing word with its diacritics could be a decisional factor in
the evaluation.
Another method was used in state-of-the-art to evaluate a morphological analyzer
system which is based on comparing the analysis results with a gold standard.
Apparently, there is no available gold standard morpho-syntactic annotation for the TD
language yet. Thus, we developed a gold standard for evaluating our tool as described
below. Then, we evaluate three basic modules of the developed system. Finally, we
conclude the strengths and weaknesses of the developed TDAT.
58 Chapter 5. Realization and Performance Evaluation
Figure 5.9: Analysis Result File window in the TDAT
5.4.1 Gold standard for the TD language
In order to evaluate our system, we aimed to develop a gold stander for the TD
language which is composed of two annotated transcription les (2409 words). Indeed,
these morpho-syntactic annotation were created manually by an expert in linguistics.
The used annotation tags are the same in the TD dictionary (Conjunction, Pronoun,
Number word, Interjection, Particle, Adjective, Adverb, Noun, Verb). Also, we included
the sux and prex as an additional information.
5.4.2 Evaluate the TDAT
We choose to evaluate three modules of our system:
Evaluate the words segmenter:
One of the basic tasks while analyzing a word using dictionary is to determine their
prex and sux. Indeed, words in dictionary could include sux and prex, so the
system has to decompose it using our module of words segmenter. In order to evaluate
this module, we rst analyze words without segmenting it, then we used our segmenter
to identify possible prex and sux for each word not recognized by our system.
Example:
5.4. Evaluation 59
The following word  e

êË e

©
t(


‚

Ó could not be recognized while analyzing it with the used
dictionary. However, by using our word segmenter module in the dictionary analyzer,
our system identies its sux (e

êË) and annotate it as a verb.
The evaluation results in terms of Accuracy are detailed in Table 5.2. The Accuracy is
described as:
Accuracy = TP+TN
TP+TN+F
= T
AW
• TP: Word recognized and correctly analyzed
• TN: Word recognized and not correctly analyzed
• F: Word not recognized
• T: All recognized word
• AW: All analyzed words
Table 5.2: Evaluation results of the word segmenter module
Recognized Not recognized Accuracy
Before using the
Segmenter module
239 826 22,44%
After using the
Segmenter module
410 655 38,49%
Indeed, by integrating the segmenter module, we achieved an improvement of 16.05%
in terms of Accuracy while using the dictionary module.
Evaluate the Levenshtein distance:
Analyzing a word with the TD dictionary could lead much analysis which is primarily
due to the dierence in diacritics writing. After testing our system in a transcription
text that is composed of 1065 words, we nd out that 47% (81 cases) of the analysis
obtained from the dictionary module are ambiguous. Indeed, multiple choices appear for
each word which is owing to the use of segmenter module. To rank these analysis, we
used the Levenshtein distance.
60 Chapter 5. Realization and Performance Evaluation
Example: Word:  e

ëñt

¢

ª

©
u
Possible analysis without applying the Levenshtein distance:  ù

¢

ª

©
u, ñt

¢

ª

©
u
Possible analysis after applying the Levenshtein distance: ñt

¢

ª

©
u,  ù

¢

ª

©
u
In this example the system selects in the rst position the word ñt

¢

ª

©
u which represents
the shortest distance comparing to the original word  e

ëñt

¢

ª

©
u.
In order to evaluate the usefulness of the employment of the Levenshtein distance
function, we evaluate the Precision score of the TD dictionary module. The Precision
formula is dened as:
Precision = TruePositive
TestOutcomePositive
= CorrectAnalaysisWord
AllRecognizedWord
As our system considers the top list analysis of a word as the best analysis, organizing
these results analysis using the Levenshtein distance will improve our system performance.
In Table 5.3 the evaluation results of the TD dictionary module before and after the
employment of the Levenshtein distance procedure.
Table 5.3: Evaluation results of the TD dictionary module using the Levenshtein distance
function
Correctly
analyzed
Not correctly
analyzed
Precision
Before using the
Levenshtein distance
52 29 64.19%
After using the
Levenshtein distance
77 4 95.06%
We notice that in some cases organizing word based in its sux and diacritics distance
do not solve the problem of ambiguities. Indeed, the integration of an classication
module based on word sentence context could solve the ambiguity problem.
Evaluate the analysis results:
The main function of our tool is to deliver as an output a morpho-syntactic annotation
for each word in the input. In order to evaluate this module, we used our Gold standard
as a corpus test.
5.5. Conclusion 61
The evaluation results of the TDAT while analyzing using all resources options are de-
tailed in Table 5.4.
Table 5.4: Evaluation results of the TDAT
Recognized Not recognized
Correct
analysis
1803 251
Incorrect
analysis
355 
We conclude a Precision score of 83.54%. Indeed, the incorrect analysis are all most
Adjective interpreted as Verb. Thus, the problem was detected when analyzing complex
TD word with Al-Khalil analyzer. The used patterns give an incorrect interpretation
when removing the word suxes.
When using the MADA analyzer, some words are tagged as Noun which is an incor-
rect analysis caused by the dierence in meaning of these words in the TD and MSA
languages.
Analysis errors when using the TD module are owing to the diacritics dierence. This
problem is caused by the TD language speaking dierence.
We used the F-measure to study the results analysis quality.
The F-measure is dened as:
F-score1 = 2∗TP
2∗TP+F
• TP: Correct analysis word
• F: Not recognized word
We obtained an F-score1 of 91.03% which is considered a promising result compared
to the existing tool for TD language.
5.5 Conclusion
In this chapter we presented the TDAT which is developed to handle the morpho-
syntactic annotation task. In order to allow an easier use and expend of our project,
62 Chapter 5. Realization and Performance Evaluation
we used free software that supports multi-platform. As well, the TDAT use dierent
options of resources which led to obtain a detailed analysis. Despite the dierence in the
needed time while analyzing comparing to the other used tools, MADA analyzer is still
very useful when using transcript that contains ocial discussion such as TV dialogue.
The obtained result when using all resources options are very promising. Indeed, we
achieved an F-score1 of 91.03% while testing in our test corpus. In addition, the developed
tool could be improved by classifying the results analysis. Also, the enrichment of the
used TD dictionary could load to achieve better result especially for Noun.
Conclusion
The tools for dialectal Arabic are few and often miss certain features or do not reach up
to the same standard as their MSA counterpart. In fact, there is a need for resources and
tools for the Arabic dialects in order to start creating new and better NLP applications.
By developing the TDAT, we aimed to provide a tool that accepts dierent types of
transcription format in order to produce morpho-syntactic annotation for TD words.
In order to build a morpho-syntactic annotation corpus for the TD language, we started
by collecting speech data. Then, we transcribed the collected data following our or-
thographic transcription guidelines and using two transcription tools (Transcriber and
Praat). Finally, we developed a tool that accepts the elaborated transcription les as an
input and gives as an output a morpho-syntactitc annotated le. To handle this task, our
tool uses TD dictionary and two analyzers (Al-Khalil TD analyzer and MADA analyzer).
In addition, the used resources options could be easily updated.
During the transcription process we created a corpus that consists of more than 1 hour
and 25 minutes. A portion of the developed corpus was used to train the developed
system.
In order to determine the capabilities of our TDAT tool to analyse TD text, we con-
structed a test Gold standard for the TD language. We used a portion of this corpus
to test the dierent modules of our tool. The evaluation results show that the used seg-
menter module realizes a score of 95.06% in terms of precision. However, using dictionary
for analyzing, regarding the obtained Accuracy score, still need more improvement by en-
riching the used dictionary especially the Noun dictionary. Due to the use of Al-Khalil
TD and the other resources options, our tool attains an F-score1 score of 91.03%.
The developed corpus could be enlarged it by integrating other topics. Furthermore,
our corpus contains dierent subjects that can be used in learning a linguistic analysis
models, in the automatic speech processing, or in any other areas of natural language
processing.
The analysis results obtained by our developed tool could be improved by: Enlarging
the TD lexical database. The use of a classication module, based on statistics for
example, to classies analysis results. Also, updating Al-Khalil pattern and database
by studying the new ambiguous case while analyzing. The input of our system could be
64 Conclusion
modied to support other speech text format such as internet web page since the dialectal
language is in increase use in social network.
Master Thesis
Master Thesis
Master Thesis
Master Thesis
Master Thesis
Master Thesis
Master Thesis
Master Thesis
Master Thesis
Master Thesis
Master Thesis
Master Thesis
Master Thesis
Master Thesis
Master Thesis
Master Thesis
Master Thesis

More Related Content

What's hot

Application mobile bancaire sous la plateforme Android
Application mobile bancaire sous la plateforme AndroidApplication mobile bancaire sous la plateforme Android
Application mobile bancaire sous la plateforme AndroidKhaled Fayala
 
Rapport PFE Lung Cancer Detection - MOHAMMED BOUSSARDI
Rapport PFE Lung Cancer Detection - MOHAMMED BOUSSARDIRapport PFE Lung Cancer Detection - MOHAMMED BOUSSARDI
Rapport PFE Lung Cancer Detection - MOHAMMED BOUSSARDIMohammed Boussardi
 
Rapport PFE MeetASAP
Rapport PFE MeetASAP Rapport PFE MeetASAP
Rapport PFE MeetASAP Aroua Jouini
 
Rapport PFE Développent d'une application bancaire mobile
Rapport PFE Développent d'une application bancaire mobileRapport PFE Développent d'une application bancaire mobile
Rapport PFE Développent d'une application bancaire mobileNader Somrani
 
Rapport de projet de fin d"études
Rapport de projet de fin d"étudesRapport de projet de fin d"études
Rapport de projet de fin d"étudesMohamed Boubaya
 
Rapport de projet de fin d’étude
Rapport  de projet de fin d’étudeRapport  de projet de fin d’étude
Rapport de projet de fin d’étudeOumaimaOuedherfi
 
Rapport PFE: PIM (Product Information Management) - A graduation project repo...
Rapport PFE: PIM (Product Information Management) - A graduation project repo...Rapport PFE: PIM (Product Information Management) - A graduation project repo...
Rapport PFE: PIM (Product Information Management) - A graduation project repo...younes elmorabit
 
La business Intelligence Agile
La business Intelligence AgileLa business Intelligence Agile
La business Intelligence Agiledihiaselma
 
Projet de fin étude ( LFIG : Conception et Développement d'une application W...
Projet de fin étude  ( LFIG : Conception et Développement d'une application W...Projet de fin étude  ( LFIG : Conception et Développement d'une application W...
Projet de fin étude ( LFIG : Conception et Développement d'une application W...Ramzi Noumairi
 
Hoạt động thanh toán quốc tế tại Ngân hàng Thương mại cổ phần xăng dầu Petrol...
Hoạt động thanh toán quốc tế tại Ngân hàng Thương mại cổ phần xăng dầu Petrol...Hoạt động thanh toán quốc tế tại Ngân hàng Thương mại cổ phần xăng dầu Petrol...
Hoạt động thanh toán quốc tế tại Ngân hàng Thương mại cổ phần xăng dầu Petrol...luanvantrust
 
Rapport pfe talan_2018_donia_hammami
Rapport pfe talan_2018_donia_hammamiRapport pfe talan_2018_donia_hammami
Rapport pfe talan_2018_donia_hammamiDonia Hammami
 
Rapport de stage d'été
Rapport de stage d'étéRapport de stage d'été
Rapport de stage d'étéJinenAbdelhak
 

What's hot (20)

Application mobile bancaire sous la plateforme Android
Application mobile bancaire sous la plateforme AndroidApplication mobile bancaire sous la plateforme Android
Application mobile bancaire sous la plateforme Android
 
Belwafi bilel
Belwafi bilelBelwafi bilel
Belwafi bilel
 
Rapport PFE Lung Cancer Detection - MOHAMMED BOUSSARDI
Rapport PFE Lung Cancer Detection - MOHAMMED BOUSSARDIRapport PFE Lung Cancer Detection - MOHAMMED BOUSSARDI
Rapport PFE Lung Cancer Detection - MOHAMMED BOUSSARDI
 
Rapport PFE MeetASAP
Rapport PFE MeetASAP Rapport PFE MeetASAP
Rapport PFE MeetASAP
 
Rapport PFE Développent d'une application bancaire mobile
Rapport PFE Développent d'une application bancaire mobileRapport PFE Développent d'une application bancaire mobile
Rapport PFE Développent d'une application bancaire mobile
 
Rapport de stage
Rapport de stageRapport de stage
Rapport de stage
 
Đề tài hiệu quả quản trị rủi ro tỉ giá, ĐIỂM 8, HOT
Đề tài  hiệu quả quản trị rủi ro tỉ giá, ĐIỂM 8, HOTĐề tài  hiệu quả quản trị rủi ro tỉ giá, ĐIỂM 8, HOT
Đề tài hiệu quả quản trị rủi ro tỉ giá, ĐIỂM 8, HOT
 
Rapport de projet de fin d"études
Rapport de projet de fin d"étudesRapport de projet de fin d"études
Rapport de projet de fin d"études
 
Rapport de projet de fin d’étude
Rapport  de projet de fin d’étudeRapport  de projet de fin d’étude
Rapport de projet de fin d’étude
 
Rapport PFE: PIM (Product Information Management) - A graduation project repo...
Rapport PFE: PIM (Product Information Management) - A graduation project repo...Rapport PFE: PIM (Product Information Management) - A graduation project repo...
Rapport PFE: PIM (Product Information Management) - A graduation project repo...
 
La business Intelligence Agile
La business Intelligence AgileLa business Intelligence Agile
La business Intelligence Agile
 
Presentation PFE
Presentation PFEPresentation PFE
Presentation PFE
 
Projet de fin étude ( LFIG : Conception et Développement d'une application W...
Projet de fin étude  ( LFIG : Conception et Développement d'une application W...Projet de fin étude  ( LFIG : Conception et Développement d'une application W...
Projet de fin étude ( LFIG : Conception et Développement d'une application W...
 
Hoạt động thanh toán quốc tế tại Ngân hàng Thương mại cổ phần xăng dầu Petrol...
Hoạt động thanh toán quốc tế tại Ngân hàng Thương mại cổ phần xăng dầu Petrol...Hoạt động thanh toán quốc tế tại Ngân hàng Thương mại cổ phần xăng dầu Petrol...
Hoạt động thanh toán quốc tế tại Ngân hàng Thương mại cổ phần xăng dầu Petrol...
 
Đề tài: Phân tích tình hình tài chính của Ngân hàng Agribank, 9đ
Đề tài: Phân tích tình hình tài chính của Ngân hàng Agribank, 9đĐề tài: Phân tích tình hình tài chính của Ngân hàng Agribank, 9đ
Đề tài: Phân tích tình hình tài chính của Ngân hàng Agribank, 9đ
 
Rapport pfe talan_2018_donia_hammami
Rapport pfe talan_2018_donia_hammamiRapport pfe talan_2018_donia_hammami
Rapport pfe talan_2018_donia_hammami
 
Rapport de stage d'été
Rapport de stage d'étéRapport de stage d'été
Rapport de stage d'été
 
Luận văn: Hoàn thiện công tác huy động vốn dân cư tại ngân hàng, HAY!
Luận văn: Hoàn thiện công tác huy động vốn dân cư tại ngân hàng, HAY!Luận văn: Hoàn thiện công tác huy động vốn dân cư tại ngân hàng, HAY!
Luận văn: Hoàn thiện công tác huy động vốn dân cư tại ngân hàng, HAY!
 
Đề tài: Hoạch định chiến lược tại công ty cổ phần cảng Nam Hải
Đề tài: Hoạch định chiến lược tại công ty cổ phần cảng Nam HảiĐề tài: Hoạch định chiến lược tại công ty cổ phần cảng Nam Hải
Đề tài: Hoạch định chiến lược tại công ty cổ phần cảng Nam Hải
 
Đề tài: Bảng cân đối kế toán tại Công ty sản xuất sắt thép, HAY
Đề tài: Bảng cân đối kế toán tại Công ty sản xuất sắt thép, HAYĐề tài: Bảng cân đối kế toán tại Công ty sản xuất sắt thép, HAY
Đề tài: Bảng cân đối kế toán tại Công ty sản xuất sắt thép, HAY
 

Viewers also liked

Stepping Up To The Next Level 2010
Stepping Up To The Next Level 2010Stepping Up To The Next Level 2010
Stepping Up To The Next Level 2010tscheschlok
 
Actividad 2 cosme_diego
Actividad 2 cosme_diegoActividad 2 cosme_diego
Actividad 2 cosme_diegoDiego cosme
 
Hsse komitmen latipah fahrun
Hsse komitmen latipah fahrunHsse komitmen latipah fahrun
Hsse komitmen latipah fahrunArya Sejahtera
 
2039884359 apresentação aula-01
2039884359 apresentação aula-012039884359 apresentação aula-01
2039884359 apresentação aula-01Naiara Gomes
 
Teaching Algoritms Using Visual Basic (Hungarian)
Teaching Algoritms Using Visual Basic (Hungarian)Teaching Algoritms Using Visual Basic (Hungarian)
Teaching Algoritms Using Visual Basic (Hungarian)Beregszászi István
 
Produk 2015 Terasik
Produk 2015 TerasikProduk 2015 Terasik
Produk 2015 Terasikipan3rut
 
Mobile Security Blanco/Ueda
Mobile Security Blanco/UedaMobile Security Blanco/Ueda
Mobile Security Blanco/UedaFernando Blanco
 
Geotechnical Investigation of Soil around Arawa-Kundulum Area of Gombe Town, ...
Geotechnical Investigation of Soil around Arawa-Kundulum Area of Gombe Town, ...Geotechnical Investigation of Soil around Arawa-Kundulum Area of Gombe Town, ...
Geotechnical Investigation of Soil around Arawa-Kundulum Area of Gombe Town, ...iosrjce
 
Chapter 14 lesson 2
Chapter 14 lesson 2Chapter 14 lesson 2
Chapter 14 lesson 2rmckinnon1
 
To be , have got and can
To be , have got and canTo be , have got and can
To be , have got and canVanina1234
 
Wh leaping frogs game
Wh   leaping frogs gameWh   leaping frogs game
Wh leaping frogs gameVanina1234
 
60314 comparatives and_superlatives solar system
60314 comparatives and_superlatives solar system60314 comparatives and_superlatives solar system
60314 comparatives and_superlatives solar systemVanina1234
 
معرفی و آموزش سامانه مدیریت محتوای مزانین - سید مسعود صدر نژاد
معرفی و آموزش سامانه مدیریت محتوای مزانین - سید مسعود صدر نژادمعرفی و آموزش سامانه مدیریت محتوای مزانین - سید مسعود صدر نژاد
معرفی و آموزش سامانه مدیریت محتوای مزانین - سید مسعود صدر نژادirpycon
 
Nuevas tecnologías
Nuevas tecnologíasNuevas tecnologías
Nuevas tecnologíasMilla9305
 
Detailed Lesson Plan (5A's)
Detailed Lesson Plan (5A's)Detailed Lesson Plan (5A's)
Detailed Lesson Plan (5A's)EMT
 

Viewers also liked (20)

Stepping Up To The Next Level 2010
Stepping Up To The Next Level 2010Stepping Up To The Next Level 2010
Stepping Up To The Next Level 2010
 
Actividad 2 cosme_diego
Actividad 2 cosme_diegoActividad 2 cosme_diego
Actividad 2 cosme_diego
 
Hsse komitmen latipah fahrun
Hsse komitmen latipah fahrunHsse komitmen latipah fahrun
Hsse komitmen latipah fahrun
 
2039884359 apresentação aula-01
2039884359 apresentação aula-012039884359 apresentação aula-01
2039884359 apresentação aula-01
 
Teaching Algoritms Using Visual Basic (Hungarian)
Teaching Algoritms Using Visual Basic (Hungarian)Teaching Algoritms Using Visual Basic (Hungarian)
Teaching Algoritms Using Visual Basic (Hungarian)
 
Gizmo brochure
Gizmo brochureGizmo brochure
Gizmo brochure
 
Produk 2015 Terasik
Produk 2015 TerasikProduk 2015 Terasik
Produk 2015 Terasik
 
Mobile Security Blanco/Ueda
Mobile Security Blanco/UedaMobile Security Blanco/Ueda
Mobile Security Blanco/Ueda
 
Geotechnical Investigation of Soil around Arawa-Kundulum Area of Gombe Town, ...
Geotechnical Investigation of Soil around Arawa-Kundulum Area of Gombe Town, ...Geotechnical Investigation of Soil around Arawa-Kundulum Area of Gombe Town, ...
Geotechnical Investigation of Soil around Arawa-Kundulum Area of Gombe Town, ...
 
Chapter 14 lesson 2
Chapter 14 lesson 2Chapter 14 lesson 2
Chapter 14 lesson 2
 
Marketing 3.0
Marketing 3.0Marketing 3.0
Marketing 3.0
 
To be , have got and can
To be , have got and canTo be , have got and can
To be , have got and can
 
Wh leaping frogs game
Wh   leaping frogs gameWh   leaping frogs game
Wh leaping frogs game
 
60314 comparatives and_superlatives solar system
60314 comparatives and_superlatives solar system60314 comparatives and_superlatives solar system
60314 comparatives and_superlatives solar system
 
Ppt ips indonesia
Ppt ips indonesiaPpt ips indonesia
Ppt ips indonesia
 
معرفی و آموزش سامانه مدیریت محتوای مزانین - سید مسعود صدر نژاد
معرفی و آموزش سامانه مدیریت محتوای مزانین - سید مسعود صدر نژادمعرفی و آموزش سامانه مدیریت محتوای مزانین - سید مسعود صدر نژاد
معرفی و آموزش سامانه مدیریت محتوای مزانین - سید مسعود صدر نژاد
 
Emerging Picture of Value Based Pricing
Emerging Picture of Value Based PricingEmerging Picture of Value Based Pricing
Emerging Picture of Value Based Pricing
 
Nuevas tecnologías
Nuevas tecnologíasNuevas tecnologías
Nuevas tecnologías
 
Geological Site Investigation Methods
Geological Site Investigation MethodsGeological Site Investigation Methods
Geological Site Investigation Methods
 
Detailed Lesson Plan (5A's)
Detailed Lesson Plan (5A's)Detailed Lesson Plan (5A's)
Detailed Lesson Plan (5A's)
 

Similar to Master Thesis

An Introduction To Text-To-Speech Synthesis
An Introduction To Text-To-Speech SynthesisAn Introduction To Text-To-Speech Synthesis
An Introduction To Text-To-Speech SynthesisClaudia Acosta
 
Modelling Time in Computation (Dynamic Systems)
Modelling Time in Computation (Dynamic Systems)Modelling Time in Computation (Dynamic Systems)
Modelling Time in Computation (Dynamic Systems)M Reza Rahmati
 
optimization and preparation processes.pdf
optimization and preparation processes.pdfoptimization and preparation processes.pdf
optimization and preparation processes.pdfThanhNguyenVan84
 
Automated antlr tree walker
Automated antlr tree walkerAutomated antlr tree walker
Automated antlr tree walkergeeksec80
 
Thesis yossie
Thesis yossieThesis yossie
Thesis yossiedmolina87
 
[Tobias herbig, franz_gerl]_self-learning_speaker_(book_zz.org)
[Tobias herbig, franz_gerl]_self-learning_speaker_(book_zz.org)[Tobias herbig, franz_gerl]_self-learning_speaker_(book_zz.org)
[Tobias herbig, franz_gerl]_self-learning_speaker_(book_zz.org)Ben Abdallah Amina
 
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportMachine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportTrushita Redij
 
programación en prolog
programación en prologprogramación en prolog
programación en prologAlex Pin
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
 
bonino_thesis_final
bonino_thesis_finalbonino_thesis_final
bonino_thesis_finalDario Bonino
 
eclipse.pdf
eclipse.pdfeclipse.pdf
eclipse.pdfPerPerso
 
A proposed taxonomy of software weapons
A proposed taxonomy of software weaponsA proposed taxonomy of software weapons
A proposed taxonomy of software weaponsUltraUploader
 
Efficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronizationEfficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronizationrmvvr143
 
Efficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronizationEfficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronizationrmvvr143
 

Similar to Master Thesis (20)

Thesis
ThesisThesis
Thesis
 
Sanskrit Parser Report
Sanskrit Parser ReportSanskrit Parser Report
Sanskrit Parser Report
 
An Introduction To Text-To-Speech Synthesis
An Introduction To Text-To-Speech SynthesisAn Introduction To Text-To-Speech Synthesis
An Introduction To Text-To-Speech Synthesis
 
Modelling Time in Computation (Dynamic Systems)
Modelling Time in Computation (Dynamic Systems)Modelling Time in Computation (Dynamic Systems)
Modelling Time in Computation (Dynamic Systems)
 
optimization and preparation processes.pdf
optimization and preparation processes.pdfoptimization and preparation processes.pdf
optimization and preparation processes.pdf
 
Automated antlr tree walker
Automated antlr tree walkerAutomated antlr tree walker
Automated antlr tree walker
 
Thesis yossie
Thesis yossieThesis yossie
Thesis yossie
 
[Tobias herbig, franz_gerl]_self-learning_speaker_(book_zz.org)
[Tobias herbig, franz_gerl]_self-learning_speaker_(book_zz.org)[Tobias herbig, franz_gerl]_self-learning_speaker_(book_zz.org)
[Tobias herbig, franz_gerl]_self-learning_speaker_(book_zz.org)
 
Liebman_Thesis.pdf
Liebman_Thesis.pdfLiebman_Thesis.pdf
Liebman_Thesis.pdf
 
Perltut
PerltutPerltut
Perltut
 
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportMachine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_report
 
programación en prolog
programación en prologprogramación en prolog
programación en prolog
 
Swi prolog-6.2.6
Swi prolog-6.2.6Swi prolog-6.2.6
Swi prolog-6.2.6
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
bonino_thesis_final
bonino_thesis_finalbonino_thesis_final
bonino_thesis_final
 
eclipse.pdf
eclipse.pdfeclipse.pdf
eclipse.pdf
 
A proposed taxonomy of software weapons
A proposed taxonomy of software weaponsA proposed taxonomy of software weapons
A proposed taxonomy of software weapons
 
cs-2002-01
cs-2002-01cs-2002-01
cs-2002-01
 
Efficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronizationEfficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronization
 
Efficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronizationEfficient algorithms for sorting and synchronization
Efficient algorithms for sorting and synchronization
 

Master Thesis

  • 1. University Of Sfax Faculty of Economics and Management of Sfax Multimedia, InfoRmation Systems and Advanced Computing Laboratory M A S T E R T H E S I S to obtain the title of Master Degree in Computer Systems and new Technology Defended by Imad Eddin Jerbi Construction and Morpho-syntactic Annotation of a Colloquial Corpus: Case of Tunisian Arabic Supervisor: Mariem Ellouze Co-supervisor: Inès Zribi - Rahma Boujelbane defended on 27th December 2013 Jury: President : Lamia Hadrich Belguith Professor (FSEGS) Reviewer : Maher Jaoua Associate Professor (FSEGS) Advisor : Mariem Ellouze Associate Professor (ESCS) Invited : Inès Zribi University of Provence Rahma Boujelbane University of Sfax
  • 2.
  • 3.
  • 4. Acknowledgments I would like rst to express my thanks to the head of ANLP-RG Mrs. Lamia Hadrich Belguith for accepting me in the research group. Above all, I would like to express my deepest appreciation to my supervisor Mrs. Mariem Ellouze Khemakhem and the co-supervisor Miss. Inès Zribi and Miss. Rahma Boujelbane - you have been a constant source of encouragement and guidance, and your faith in me is largely responsible for not only completing this thesis but also enjoying working on it. I would like to thank the jury members: Mr. Maher Jaoua, Mrs. Lamia Hadrich Belguith, Mrs. Mariem Ellouze, Miss. Inès Zribi and Miss. Rahma Boujelbane for their precious time reading my thesis and for their constructive comments. I must not forget to thank my professors who generously shared their expertise. Also, I especially thank the master department director Mr. Mahmoud Naji. I also would like to thank my family and all my friends especially Hakim Mkacher and Hamdi Zroud for their support and help. Thank you ALL!
  • 5.
  • 6. Contents Acknowledgments i List of gures v List of tables vii List of Abbreviations xi Introduction 1 I Related Work 3 1 Linguistic Resources 5 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Speech and Text Data Collection . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Arabic Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.1 Modern Standard Arabic Corpora . . . . . . . . . . . . . . . . . . 6 1.3.2 Dialectal Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Orthographic Transcription . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.1 Transcription Software . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.2 Transcription Guidelines . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Morpho-Syntactic Annotation 13 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Morpho-syntactic Annotation Methods for MSA Language . . . . . . . . 13 2.3 Morpho-syntactic Annotation Methods for Dialects Arabic Language . . 16 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 II Proposed Method 19 3 Data Collection and Transcription 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Speech Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
  • 7. iv Contents 3.3 Transcription Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 Transcription Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.2 Transcribing Guidelines . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 Morpho-syntactic Annotation Method 33 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Our main Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Preliminary Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 Word analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 Choosing results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.6 Result le generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5 Realization and Performance Evaluation 45 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.1 Al-Khalil analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.2 MADA analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.3 Al-Khalil TD version . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.4 Tunisian Dialect Dictionary . . . . . . . . . . . . . . . . . . . . . 46 5.3 Tunisian Dialect Annotation Tool . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Process of TDAT . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.2 Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3.3 TDAT Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.4.1 Gold standard for the TD language . . . . . . . . . . . . . . . . . 58 5.4.2 Evaluate the TDAT . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Conclusion 63 A TD Enriched Orthographic Transcription 65 A.1 Inter-pausal units segmentation . . . . . . . . . . . . . . . . . . . . . . . 65 A.2 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 A.2.1 Typographic rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 A.2.2 Pronunciation notation . . . . . . . . . . . . . . . . . . . . . . . . 67 A.2.3 Liaisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
  • 8. Contents v A.2.4 Non Arabic phonemes . . . . . . . . . . . . . . . . . . . . . . . . 68 A.2.5 Reported speech . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 A.2.6 Incomprehensible sequences . . . . . . . . . . . . . . . . . . . . . 68 A.2.7 Laughers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A.2.8 Pauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 B LCD Commitment 71 Bibliography 73
  • 9.
  • 10. List of Figures 3.1 The proportion of themes in the corpus . . . . . . . . . . . . . . . . . . . 22 3.2 Sections and Turns in Transcriber . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Manage speakers in Transcriber . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Manage speakers in Praat . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Morpho-syntactic annotation steps in our method . . . . . . . . . . . . . 34 4.2 Example of the Segmented Text . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Analyzing process of a TD word . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Analyzing a word with TD dictionary . . . . . . . . . . . . . . . . . . . . 40 5.1 XML Shema of the Annotated Text . . . . . . . . . . . . . . . . . . . . 48 5.2 Main Interface in the TDAT . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 Transcription window in the TDAT . . . . . . . . . . . . . . . . . . . . . 53 5.4 Segmented Text window in the TDAT . . . . . . . . . . . . . . . . . . . 54 5.5 Add an analysis result window in the TDAT . . . . . . . . . . . . . . . . 54 5.6 Analyze window in the TDAT . . . . . . . . . . . . . . . . . . . . . . . . 55 5.7 Analyse Details window in the TDAT . . . . . . . . . . . . . . . . . . . 56 5.8 Annotation options window in the TDAT . . . . . . . . . . . . . . . . . 57 5.9 Analysis Result File window in the TDAT . . . . . . . . . . . . . . . . . 58
  • 11.
  • 12. List of Tables 3.1 Corpus les content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Description of tags used in the references XML le . . . . . . . . . . . . 24 3.3 Clitics in the TD language . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Description of tags used in the segmented text le . . . . . . . . . . . . . 35 4.4 Description of tag used in the annotation result le . . . . . . . . . . . . 42 4.1 Annotations extracted from the transcription les . . . . . . . . . . . . . 43 4.3 Levenshtein distance table example . . . . . . . . . . . . . . . . . . . . . 44 5.1 Description of icons used in the annotation interface . . . . . . . . . . . . 54 5.2 Evaluation results of the word segmenter module . . . . . . . . . . . . . 59 5.3 Evaluation results of the TD dictionary module using the Levenshtein distance function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.4 Evaluation results of the TDAT . . . . . . . . . . . . . . . . . . . . . . . 61
  • 13.
  • 14. List of Abbreviations AMADAT Arabic Multi-Dialectal Transcription Tool AOC Arabic Online Commentary Dataset CES Corpus Encoding Standard DTD Document Type Denitions ECA CALLHOME Egyptian Arabic Speech ELAN EUDICO Linguistic Annotator FBIS Foreign Broadcast Information Service HMM Hidden Markov Model HPL Hewlett-Packard Laboratories ICA Iraqi Colloquial Arabic LA Levantine Arabic LATB Levantine Arabic TreeBank LDC Linguistic Data Consortium MSA Modern Standard Arabic NLP Natural Language Processing OSAC Open Source Arabic Corpora (Updated) OSAc Open Source Arabic Corpus POS Part of Speech POST Part of Speech Tagging SAAV B Saudi Accented Arabic Voice Bank SAMPA Speech Assessment Methods Phonetic Alphabet STT Speech-to-Text TD Tunisian dialects TDAT Tunisian Dialect Annotation Tool TEI Text Encoding Initiative XML Extensible Markup Language XSL eXtensible Stylesheet Language
  • 15.
  • 16. Introduction The Arabic language is speaking by about 300 million people [Al-Shamsi 2006] and the fourth most spoken language. Thus, it is a major international modern language. Considering the amount of people it is spoken by, the number of computing resources for the Arabic language is still few. The Arabic language is a blend of the modern standard Arabic, used in written and formal spoken discourse, and a collection of related Arabic dialects. This mixture was dened by Hymes [Hymes 1973] as a linguistic continuum. Indeed, Arabic dialects present signicant phonological, morphological, lexical, and syn- tactic dierences among themselves and when compared to the standard written forms. Furthermore, the presence of diglossia [Ferguson 1959] is a real challenging issue for Arabic speech language technologies, including corpus creation to support Speech-to- Text (STT) systems. In addition, other diculties for researchers lie in Arabic dialects as a morphologically complex language. Also, there is a small amounts of text data for spoken Arabic due to the non ocially written of this language. A better set of corpus will support, in the rst place, further research into the area for example support linguistics researchers in analysis of the Arabic dialects phenomenas. It also lays the ground for creating new and a better end-user application. One of the fundamental parts of any application of the Natural Language Processing (NLP)in a specic language, such as in the Tunisian Dialect (TD), is the existence of corpora. Indeed, the construction of speech corpora for the TD is fundamental for study- ing its specication and to advance its NLP application for example speech recognition. Some small corpus exist nowadays and have been developed by previous researchs. How- ever, these corpus combined other dialects and they are not specic for the TD also they do not include any diacritics information. In general, these corpus are either in closed projects or not freely available. In addition, these corpora do not include any morpho- syntactic annotation or phonetic information. The aim of this project is to investigate how to collect and transcribe speech data, the possibility of using existing tools for transcribing, the choice of the appropriate guidelines. How to annotate the transcripts?, which methods are used?, The report is divided in two parts. The rst part presents the state of the art of ex- isting speech corpora resources for Arabic language. The chapter two lists some morpho- syntactic annotation methods. The second part describes the used method and resources to collect, transcribe, and annotate speech data. The third chapter is devoted to present
  • 17. 2 Introduction steps that we have followed to collect and transcribe speech data. In the fourth chapter, we present our method to achieve the morpho-syntactic annotation task. The last chapter is devoted to present the used tools and resources, the developed tool and the obtained results. Finally, a conclusion that summarizes the results of our work and presents some future prospects that will be given at the end of this report.
  • 19.
  • 20. Chapter 1 Linguistic Resources 1.1 Introduction A spoken language corpus is dened as a collection of speech recordings which is ac- cessible in computer readable form and which comes with annotation and documentation sucient to allow re-use of the data in-house, or by scientists in other organizations [Gibbon 1997]. Indeed, creating speech corpora is crucial for the studying of dierent characteristics of the spoken language. As well as the development of applications which deal with the voice, for example speech recognition system. In this chapter, we will answer the following question: How to create a speech corpus? Therefore, we present methods of speech and text data collection in the next section. Then, we focus on presenting some available corpora for Arabic language. After that, we introduced the orthographic transcription task by presenting a literature recap of its guidelines and tools. 1.2 Speech and Text Data Collection A prerequisite for a successful development of spoken language resources is a good denition of the collected speech data. There are three steps in text data collection: The rst is to precise the source of data (books, novels, chat rooms, etc), type (standard written language or dialectal), theme (social, news, sport, etc), encoding and format (transcribed les, web pages, XML les, etc). The second step called data collection, which is performed using dierent techniques such as harvesting large amounts of data from the web [Diab 2010] and speech data transcribing [Messaoudi 2004]. As well as an automatic speech recognition method like used in [Messaoudi 2004, Gauvain 2000] could be performed to extract text from speech data. The third step consists in adapting and organizing these data [Diab 2010]. Consequently, speech data collection follows the same steps as text collection. In addition, speech data with their dierent types (audio or video) and formats (mp3, wave, avi, etc),
  • 21. 6 Chapter 1. Linguistic Resources are collected with dierent ways. The easier one, is to download streaming videos and audios from the Internet. Unfortunately, in this method, we couldn't guarantee the good quality of data. Otherwise, we need to refer to recording method where we could x subjects and speakers dialects as we wish. At the same time, we could ensure a best quality of data. However, we need funding to pay speakers and to buy specic equipments for recording. Some annotation tools [Kipp 2011] give the possibility to access directly to the broad- cast data by using the associated Uniform Resource Locator (URL) which overcame the step of collection. Regrettably, this feature is currently not available in the used voice annotation tools. 1.3 Arabic Corpora The Arabic language is composed of a collection of standard written language (Modern Standard Arabic) and spoken dialects lanquage. The Arabic dialects are used extensively in almost all everyday conversations. Therefore, they have a considerable importance. However, owing to the lack of data and of poor resources, Natural Language Processing (NLP) technology for Arabic dialectal is still in its infancy. Therefore, basic resources, tokenizers, and morphological analyzers which are developed for Modern Standard Arabic (MSA), are yet virtually non-existent for dialectal. 1.3.1 Modern Standard Arabic Corpora There are many research projects involved in MSA corpora's development such as the updated version of the Open Source Arabic Corpora (OSAC) described in [Saad 2010] which include corpora; The British Broadcasting Corporation (BBC) Arabic corpus col- lected from bbcarabic.com, the Cable News Network (CNN) Arabic corpus collected from cnnarabic.com, and the Open Source Arabic Corpus (OSAc) collected from multiple sites. The OSAC corpus contains about 23 Mega after removing the stop words. Foreign Broadcast Information Service (FBIS) corpus is another corpus for MSA cre- ated by [Messaoudi 2005] and used in [Vergyri 2004]. The data set comprises a collection of radio newscasts from various radio stations in the Arabic speaking world (Cairo, Dam- ascus, Baghdad) totalling approximately 40 hours of speech roughly 240K words. The transcription of the FBIS corpus was done in Arabic script only and does not contain any diacritic information.
  • 22. 1.3. Arabic Corpora 7 The Linguistic Data Consortium (LDC) provided the Penn Arabic Treebank [Duh 2005] which is a data set of newswire text from Agence France Press, An Na- har News, and Unmah Press transcribed in standard MSA script. There are more than 113,500 tokens which are analyzed and provided with disambiguated morphological in- formation. 1.3.2 Dialectal Corpora At present, the major standard dialect corpora are available through the LDC by the DARPA EARS (Eective, Aordable, Reusable Speech-to-Text) program where develop- ing robust speech recognition technology to address a range of languages and speaking styles, and which includes data from the Egyptian, Levantine, Gulf, and Iraqi dialects. Also, the LDC provides conversational and broadcast speech with their transcripts. The Levantine Arabic (LA) is presented by the Levantine Arabic QT Training Data Set [Maamouri 2006a] which is a set of phone conversations of Levantine Arabic speakers [Duh 2005]. Furthermore, the data set contains approximately 250 hours of telephone conversations. About 2000 successful calls have been collected which are distributed in terms of regional dialect (Levantine, Egyptian, Gulf, Iraqi, Moroccan, Saudi, Yemeni). Also, LA provided both the conversations speech and their transcript. Moreover, another Arabic colloquial corpora called CALLHOME Egyptian Arabic Speech (ECA), which has been used in [Duh 2005, Gibbon 1983], was dedicated to the Egyptian dialect. Indeed, the data set consists of 120 telephone conversations between native speakers of Egyptian dialect. In fact, ECA corpus contains both dialectal and MSA words forms. Also, ECA was accompanied by a lexicon containing the morpholog- ical analysis of all words, an analysis in terms of stem and morphological characteristics such as person, number, gender, POS, etc. Last but not least, the Saudi Arabic dialect was represented by the Saudi Accented Arabic Voice Bank (SAAV B) [Alghamdi 2008] which is very rich in terms of its speech sound content and speaker diversity within the Saudi Arabia. The duration of the total recorded speech is 96.37 hours distributed among 60,947 audio les. Indeed, SAAV B was externally validated and used by IBM Egypt Branch to train their speech recognition engine.
  • 23. 8 Chapter 1. Linguistic Resources English-Iraqi corpus is another Arabic corpora mentioned in [Precoda 2007] and which consists of 40 hours of transcribed speech audio from DARPA's Transtac program. Many small corpora were developed in order to satisfy specic needs like described in [Al-Onaizan 1999, Ghazali 2002, Barkat-Defradas 2003] where ten speakers originally from the western zone (Egyptian Arabic, Syrian, Lebanon and Jordan) and Moroccan Arabic area (Algeria and Moroccan) who are listening to the story the north wind and the sun in French and translated spontaneously into their dialects. Some other research projects are limited to the collection of dialectal text data, for example the Arabic Online Commentary Dataset (AOC) mentioned in [Zaidan 2011] created with crawling the websites of three Arabic newspapers (Al-Ghad, Al-Riyadh, Al-Youm Al-Sabe), the commentary data consists of 52.1M words and includes sentences from articles in the 150K crawled webpages. In fact, 41% of the data content had dialectal words. Additional to the AOC, an Arabic-Dialect/English Parallel Text was developed by Raytheon Bolt, Beranek and Newman (BBN) technologies, LDC and Sakhr Software. Infact, this corpus contains approximately 3.5 million tokens from Arabic dialect sentences with their English translations. Data in this corpus consist of Arabic web text that was ltered automatically from a large Arabic text corpora provided by the LDC. 1.4 Orthographic Transcription The acoustic signal of audio content may correspond to speech, music or noise, but also mixtures of speech, music and noise. In addition to that, there is a variety of speakers and topics in the same record. Indeed, transcribers can work on a given subject successively or simultaneously. The sound quality of the recording (delity) may vary signicantly over time. Dierent stages of transcription work are: the segmentation of the soundtrack, the identication of turns and speakers, the orthographic transcription, and verication. De- pending on the choice of the transcriber, these steps can be conducted in parallel or sequential manner instead of long portion of the signal. The diculty of transcribing depends on the number of speakers involved in the record- ings and their clear pronunciation. Processing many les in quick succession does not make the work faster as the exhaustion slows down the process. Also, it is preferable
  • 24. 1.4. Orthographic Transcription 9 to take a rest between each le. In [Al-Sulaiti 2004] the average time for transcribing a ve-minute Arabic spoken record for a non-professional typist, without including any enriched orthographic annotation, is 1:50:42. The average time is the average between the shortest and the longest time taken in transcription. The step of annotation aim at structure the recording, which means to be segmented and to describe the acoustic signal at dierent levels deemed relevant for the further processing. The transcription could not reect neither the audio record perfectly nor pronunciation of a same subject/term, and could be the subject of a deepen study on semantic and syntax, pronunciation analysis. Manual transcription of audio recordings such as radio or television streaming, will ad- vance research in automatic transcription, indexing and archiving. Indeed, the transcrip- tion will provide a linguistic resource and data that make possible the construction of an automatic recognition system, which will be then used to produce automatic transcrip- tion. 1.4.1 Transcription Software There are dierent types of tools for labelling and annotation of speech corpora. Some of them are addressed for audio formats such as Transcriber [Barras 2000], Praat [D.Weenink 2013], SoundIndex1 , and AMADAT 2 , and other for video formats for exam- ple Anvil [Kipp 2011] and EUDICO Linguistic Annotator (ELAN) [Dreuw 2008]. • Transcriber is a free software and has been used in many projects such as [Messaoudi 2005, Piu 2007, Fromont 2012] and becoming very popular due to its simplicity and eciency as it provides in an easier way transcribing and labelling. • Praat is a productivity tool for phoneticians. It allows speech analysis, synthesis, la- belling and segmentation, speech manipulation, statistics, learning algorithms, and manipulation package includes statistical stu,produces publication-quality graph- ics. • SoundIndex is a tool that allows user to write tags audio at any level in the hierarchy of an XML le by setting values for attributes such as the start and the end of audio in the sound editor. The interpretation of tags audio is written in XSL. 1Software documentation on http://michel.jacobson.free.fr/soundIndex/Sommaire.htm. 2AMADAT User Guidelines on http://projects.ldc.upenn.edu/EARS/Arabic/EARS_AMADAT.htm.
  • 25. 10 Chapter 1. Linguistic Resources • Arabic Multi-Dialectal Transcription Tool (AMADAT) allows transcribing speech, and gives a very helpful functionality that provides a correction level. • Anvil is a free video annotation tool. It oers frame-accurate, hierarchical multi- layered annotation driven by user-dened annotation schemes. The color-coded elements which have been noted on multiple tracks in time-alignment had been shown by the initiative board. Also, special features are cross-level links, non- temporal objects and are as project tool for managing multiple annotations. It allows the importation of data from the widely used, public domain phonetic tools Praat and XWaves. Anvil's data les are XML-bad. • ELAN is an annotation tool that allows creating, editing, visualizing and search- ing annotations for video and audio data. This software aims to provide a sound technological basis for the annotation and exploitation of multi-media recordings. Adding to that, ELAN is specically designed for the analysis of language, sign language, and gesture, yet it can be used in media corpora, with video and/or audio data, for purposes of annotation, analysis and documentation. 1.4.2 Transcription Guidelines The transcription process follows specic agreements to provide structured records in thematic content, speakers, and other speech information. These tools produce informa- tion which are called annotations. Nowadays, a lot of conventions have been made in NLP projects to satisfy the need of homogeneous transcription manner and to provide annotation enrichments. Generally, these conventions are depending on the used speech data format and on the transcription tool. When a speech corpus is transcribed into a written text, the transcriber is immediately confronted with the following question: What are the reects of the oral speech realty in a corpus? A set of rules for writing speech corpora are designed to provide an enriched ortho- graphic transcription. These conventions establish the annotated phenomena. Numerous studies have been carried out in prepared speech, as for example for broadcast news [Cam 2008]. However, conversational speech refers to an activity more informal, in which partici- pants have constantly managed topic, dened speakers and distinguished speech turns which correspond to changes in speaker [Gro 2007, Cam 2008, André 2008]. As a conse- quence, numerous phenomena appear such as hesitations, repeats, feedback, backchan-
  • 26. 1.4. Orthographic Transcription 11 nels, etc. Other phonetic phenomena such as non-standard elision, reduction phenomena [Meunier 2011], truncated words, and more generally, non-standard pronunciations are also very frequent. All these phenomena can impact on the phonetization. Hence, identifying dierent types of pause; long pause between turn-taking and short pause between words, is very useful for a further purpose such as the development of a voice recognition system [Alotaibi 2010]. In [Gro 2007] conventions focus on segment structures presented by elongation, trun- cation, aspirations, sigh. The spontaneous oral production is a real problem in term of annotation awing to according to [Shriberg 1994]: Disuencies show regularities in a variety of dimensions. These regularities can help guide and constrain models of spoken language production. In addi- tion they can be modeled in applications to improve the automatic processing of spontaneous speech. Another denition of disuencies in [Piu 2007]: Disuencies (repeats, word-fragments, self-repairs, aborted constructs, etc) inherent in any spontaneous speech production constitute a real diculty in terms of annotation. Indeed, the annotation of these phenomena seems not easily automatizable, because their study needs an interpretative judgement. In fact, there are dierent types of disuencies as described in [Piu 2007]: • Repetition disuency is among the most frequent types of disuency in conversa- tional speech (accounting for over 20% of disuencies), according to [Cole 2005]: Repetition disuencies occur when the speaker makes a premature com- mitment to the production of a constituent, perhaps as a strategy for holding the oor, and then hesitates while the appropriate phonetic plan is formed. • Self-correction as described in [Kurdi 2003] is the substitution of a word or series of words from an others, to modify or correct a part of the statement. • In [Pallaud 2002] high lighted primers as an interruption morpheme being enuncia- tion. Generally, disuencies could be combined simultaneously with the association of at least two of phenomena mentioned above.
  • 27. 12 Chapter 1. Linguistic Resources In [Dipper 2009]: Transcription guidelines specify how to transcribe letters that do not have a modern equivalent. They also specify which letter forms represent variants of one and the same character, and which letters are to be transcribed as dierent characters. In [Cam 2008] there are numerous transcription rules related to the speech-text such as how to write letters, punctuation, numbers, Internet addresses, acronyms, spelling, abbreviations, hesitations, repetitions, truncation, absent or unknown words. Also a list of markups was used to identify noise, pronunciation problems, back channel, and comments. In [André 2008] the specic pronunciations were recorded with the SAMPA phonetic alphabet. General roles for transcribing the short vowels spelling, and diacritics have been presented by The LDC Guidelines For Transcribing Levantine Arabic 3 . 1.5 Conclusion Despite all the attempts made by the LDC and the other research projects to provide speech corpora for Arabic dialects, some languages like Arabic Tunisian Dialects (TD) still needs more improvements in corpus construction. Yet, these attempts knew several challenges. Some of them are related to Arabic language and to general NLP issue. Moreover, there are some problems that could be noticeable during the process of speech transcription such as the ambiguity of the word transcription. Another problem accrued when multiple sound sources are present, then it is necessary to focus on the transcription source most emerging. As well, when two speakers are talking in the foreground, we could transcribe both through the mechanism of super imposed speech. 3The Guidelines are available on the LDC website http://ldc.upenn.edu/Projects/EARS/Arabic/ www.Guidelines_Levantine_MSA.htm
  • 28. Chapter 2 Morpho-Syntactic Annotation 2.1 Introduction A transcript could be annotated by adding linguistic information for each word. In [Sawaf 2010] a denition of linguistic Annotation is: Corpus annotation is the practice of adding interpretative, especially linguistic, information to a text corpus, by coding added to the electronic representation of the text itself. Indeed, grammatical tagging is the task of associating a label or a tag for each word in the text to indicate its grammatical classication. Generally speaking, the morpho-syntactic annotation process relies on word structure as described in [Al-Taani 2009]. Accordingly, patterns and axes are used to deter- mine the word grammatical class following a dened rules. Moreover, this process diers according to the used data and resources. Therefore, approaches used in the morpho-syntactic annotation process vary from supervised to unsupervised as described in [Jurafsky 2008]. In the following sections we introduce morpho-syntactic annotation methods for the Mod- ern Standard Arabic (MSA) language and Tunisian Dialect (TD) languages. 2.2 Morpho-syntactic Annotation Methods for MSA Language The morpho-syntactic annotation process for MSA has been performed using dif- ferent approaches such as statistical approach [Al-Shamsi 2006] and learning approach [Bosch 2005]. Recently, some works combine approches to improve the performance of the developped tagger. In the following, a description of some works. The statistical approach was used in the work of [Al-Shamsi 2006] to handle the POST of Arabic text. Indeed, the developed method was based on the HMM and followed these
  • 29. 14 Chapter 2. Morpho-Syntactic Annotation steps: 1. Creation of a set of tags, 2. Employment of Buckwalter's stemmer to stem Arabic text from the used corpus which contains 9.15 MB of native Arabic articles, 3. Manually correction of tagging errors, 4. Design and construction of an HMM-based model of Arabic POS tags, 5. Training of the developed POST on the annotated corpus. The proposed method achieved an F-measure score of 97%. Another tagging system for Arabic POST was proposed by [Hadj 2009]. The POS tagging task of the system is based on the sentence structure and combines morphological analysis with HMM: - The morphological analysis was aimed to reduce the size of the tags lexicon by seg- menting words in their prexes, stems, and suxes. - The HMM was used to represent the sentence structure in order to take in account the logical linguistic sequencing. Each possible state of the HMM presents a tag. The transitions between those states are governed by the syntax of the sentence. The training corpus data is composed of some old texts extracted from Books of third century. These data were manually tagged using the developed tagset. The system evaluation was based on the same corpus. The obtained result achieves a recognition rate of 96%, which is considering as very promising compared to the size of tagged data. In addition to the statistical approaches, recent applications tend to explore the use of a machine learning methods to handle the Arabic morphology and POS tagging process. Indeed, a memory-based learning approach was developed by [Bosch 2005] in order to morphologically analysis and part-of-speech tagging of written Arabic. The learning classication task in the memory-based learning was performed by employing the k- nearest neighbor classier that searches for the k nearest neighbors. The memory-based learning, which is a supervised inductive learning algorithm, treats a set of labeled training instances as points in a multi-dimensional feature space. Then, stores these instances as such in an instance base in memory. Furthermore, [Bosch 2005] employed a modied
  • 30. 2.2. Morpho-syntactic Annotation Methods for MSA Language 15 value dierence metric distance function to determine the similarity of pairs of values of a feature. The used metric explores the conditional probabilities of the two values conditioned on the classes to determine their similarity. In order to train and test the developed approach, [Bosch 2005] exploited the Arabic Treebank 1 (version 2.0) corpus which is consisting of 166,068 tagged words. Evaluat- ing the morphological analyzer was based on predicting the part-of- speech tags of the segments, the positions of the segmentations, and all letter transformations between the surface form and the analysis. The obtained results in term of precision, recall, and F- score are consequently 0.41, 0.43 and 0.42. Also, the POS tagger attained an accuracy of 66.4% on unknown words, and 91.5% on all words in held-out data. Furthermore, combining the morpho-syntactic analysis generated from the morpholog- ical analyzer and the part-of-speech predicted by the tagger yields a joint accuracy of 58.1%. This accuracy represents the correctly predicted tags and corresponds to the full analysis for unknown words. The main limitation of the memory-based learning approach, as concluded in [Bosch 2005], was its inability to recognize the stem of an unknown word and accordingly the appropriate vowel insertions. Another approach combine statistical and rule-based techniques was introduced by [Khoja 2001] in order to construct an Arabic part-of-speech tagger. First, the developed approach is based on: The use of traditional Arabic grammatical theory to determine rules used while stemming a word. These rules are used to determinate stem or root by removing axes (prexes, suxes and inxes). Second, this approach used lexical and contextual probabilities. The lexical probability is the probability of a word having a certain grammatical class. Whereas the contextual probability is the probability of one tag following another tag. These probabilities are calculated from the tagged training corpus. The used method consists on searching a word in the lexicon to determine its possible tags. The words not found in the lexicon are then stemmed using a combination of axes to determine the tag of the word. Finally, in order to disambiguate ambiguous words and unknown words, [Khoja 2001] used a statistical tagger that was based on the Viterbi algorithm [Jelinek 1976]. In order to train the tagger and to construct the lexicon, [Khoja 2001] used a corpus that contains 50,000 words in Modern Standard Arabic extracted from the Saudi Al-
  • 31. 16 Chapter 2. Morpho-Syntactic Annotation Jazirah newspaper. That is manually tagged. The constricted lexicon contains 9,986 words. In order to test the developed tagger, four corpora (85159 words) were collected from newspapers and papers in social science. In addition to MSA words the test corpus contained some colloquial words. The statistical tagger achieved an accuracy of around 90% when disambiguating ambiguous words. Furthermore, [Khoja 2001] used Arabic dictionary (4,748 roots) to test the developed stemmer. The obtained result in terms of accuracy achieves 97%. As well the unana- lyzed words are generally foreign terms and proper nouns, as incorrect writing words, [Khoja 2001] conclude that employing a pre-processing component could solve the prob- lem. 2.3 Morpho-syntactic Annotation Methods for Di- alects Arabic Language In [Maamouri 2006b] a description of a supervised approach was used to annotate di- alect Arabic data. Indeed, a word list of the Levantine Arabic Treebank (LATB) data was used to manually annotate the most frequent surface forms. Then, perform the pat- tern matching operations to identify potential new prex-stem-sux combinations among the remaining unannotated words in the list. Furthermore, the Morphological/Part-of-Speech/Gloss (MPG) tagging included morpho- logical analysis, POST, and glossing. The evaluation of the developed system was based after and before by using dictionary. The evaluation result shows that there is more than 10% reduction of annotation error. Another supervised approach in [Duh 2005] was designed for tagging the dialectal Ara- bic using LCA data. The developed system is based on using a statistical trigram tagger in the form of a HMM and Baseline POST. Indeed, a statistical modeling and cross- dialectal data sharing techniques were used to enhance the performance of the baseline tagger. The adopted approach requires only original text data from several varieties of Arabic and a morphological analyzer for MSA. So, there is no dialect-specic tools were used. To evaluate the developed system, [Duh 2005] compare between the obtained results with those received ones when using:
  • 32. 2.3. Morpho-syntactic Annotation Methods for Dialects Arabic Language17 - a supervised tagger which were trained on hand-annotated data. - a state-of-the-art MSA tagger applied to Egyptian Arabic. As a result, there is 10% improvement of the ECA tagger. Additionally to the supervised approaches, other projects tend toward the use of un- supervised approach for example [Chiang 2006]. An Arabic dialects parser was described in [Chiang 2006] where three frameworks were constructed for leveraging MSA corpora in order to parse LA. This process was based on knowledge about the lexical, morphological, and syntactic dierences between MSA and LA . [Chiang 2006] evaluated three methods: • Sentence transduction : in which the LA sentence to be parsed is turned into an MSA sentence and then parsed with an MSA parser; • Treebank transduction : in which the MSA treebank is turned into an LA treebank; • Grammar transduction : in which an MSA grammar is turned into an LA grammar which is then used for parsing LA. The used MSA treebanks data, comprises 17,617 sentences and 588,244 tokens. Indeed, it is composed of four dierent lexicons: Small lexicon with uniform probabilities, small lexicon with EM-based probabilities, big lexicon with uniform probabilities, and big lex- icon with EM-based probabilities. To evaluate the developed parser, [Chiang 2006] used data that comprise 10% MSA tree- banks and 2051 sentences and 10,644 tokens from the Levantine treebank LATB. The major limitation in this method as concluded in [Chiang 2006] is the non employment of a demonstration of cost-eectiveness. Another approach described in [Al-Sabbagh 2012] used a function-based annotation scheme in which words are annotated based on their grammatical functions. Indeed, the grammatical categories in which morpho-syntactic structures and grammatical functions could be dierent from each other. The developed method was based on the implemen- tation of a Brill's Transformation-Based POS Tagging algorithm. The developed tagger was trained on a manually-annotated Twitter-based Egyptian Ara- bic corpus which is composed of 22,834 tweets and contains 423,691 tokens. In order to evaluate the developed POS tagger, ten cross validation were performed. The obtained
  • 33. 18 Chapter 2. Morpho-Syntactic Annotation results in term of F-measure are 87.6% for the task of POS tagging without semantic feature labeling and 82.3% for the task of POS tagging with tokenization and semantic features. The problem faced while analyzing is related to the three-letter and two-letter words which are highly ambiguous and they can have multiple readings based on the short vowel pattern. Example: The word ‰g F can be analyzed by the tagged as a: Noun: meaning grandfather or seriousness. Adverb: meaning seriously. To solve this problem, [Al-Sabbagh 2012] concluded that a word sense disambiguation modules is fundamental to improve performance on highly ambiguous words. 2.4 Conclusion In this chapter, we introduced methods of morpho-syntactic annotation for MSA and dialectal Arabic. The employment of approaches in morpho-syntactic annotation task such as statistical and learning approach depend on the used resources data. Indeed, the unsupervised techniques are not suitable for poor language resources such as the dialectal languages. Therefore, the POS Tagging process for colloquial Arabic still needs more improvement in term of corpus collection and annotation.
  • 35.
  • 36. Chapter 3 Data Collection and Transcription 3.1 Introduction The transcribing process consists of two basic steps: The rst one is performed to pro- vide voice data in order to be transcribed later. The second step consists in transcribing the voice data by following directives that we have established. Indeed, to allow a better representation of spontaneous speech phenomena, these directives take in consideration the TD transcript specication. More details about those two steps are described in the following sections. 3.2 Speech Collection The aim of this section is to provide speech data which is the rst step in corpus creation. Choice of the speech data content and type is very important and could be the key of further use of our corpus. That is to say, we choose to provide both audio and video speech to improve the use of our corpus in new research tendency especially in video annotation [Kipp 2011]. Furthermore, including dierent TD (Sfaxian dialect, Sahel dialect, etc) will improve the representativeness of the TD in our corpus. To provide speech data, we used broadcast conversational speech to be the main source of speech data in our corpus as used in these two projects [Lamel 2007] and [Belgacem 2010]. These streaming are generally radio and television talk shows, debates, and interactive programs where the general public are invited to participate in discussion by telephone. In general, the commune conversational dialect in Tunisia is the dialect of the capital, the same used in national TV and radio stations and by the majority of educated people. Consequently, we have allocated the largest part of our corpus to this dialect.
  • 37. 22 Chapter 3. Data Collection and Transcription Providing speech data with a variety of themes will increase the size of the vocabulary in our corpus and will be very useful for further application for example theme classica- tion [Bischo 2009]. Indeed, we dened the following theme's list in our data selection: Religious, Political, Cooking, Health, and Social. The latter could include record- ings that refer to more than one theme. We also dene the Other tag to assign other types of themes. Figure 3.1 shows the proportion of each theme in our corpus. Figure 3.1: The proportion of themes in the corpus Having a good amount of spoken recordings is fundamental in the design of the corpus. Also, a high sound quality is required and will be useful for other future processing for example in voice recognition system. In addition, we provided both individual and multiple speakers in our collection to identify dierent aspects of conversational speech. In table 3.1 a description of each transcription le in our corpus. The transcribed les achieved a duration of 1 hour and 25 minutes and 37 seconds. The collected data les generally have a long duration that exceeds fteen minutes; to simplify the transcription task we split these records in order to obtain sequences with
  • 38. 3.2. Speech Collection 23 Table 3.1: Corpus les content File Duration (sec) Number of Speakers Size (megabyte) Type 01 758 1 12,1 Video 02 867 3 13,9 Video 03 216 12 19.1 Audio 04 232 13 20.5 Audio 05 204 9 18.0 Audio 06 360 8 31.8 Audio 07 900 2 14.4 Video 08 364 3 5.8 Video 09 477 2 7.6 Video 10 759 2 12.2 Video duration between ve and fteen minutes. Then we convert them to the MP3 format to be adapted to the entry of the used transcription software. In order to provide more details about our speech data collection, we provided the references.xml le which contains a description of each le in our corpus. The XML has been used to represent the data structures in this le. Such representation allows a simple preview for users and an easier integration in feature annotation system. Adding to that, we assigned a DTD le that we named references.dtd to validate the previous XML le. In Table 3.2 we nd a description of all dierent tags used in the references.xml le.
  • 39. 24 Chapter 3. Data Collection and Transcription Table 3.2: Description of tags used in the references XML le Label Type Name Unit Description ID Integer Identier Identier of record NAM String Name Name of record DUR Integer Duration sec Duration of record TOP List Topic Topic of record NBMASP Integer Number of male speaker Number of male speakers in a record NBFESP Integer Number of female speaker Number of female speakers in a record FISI Float File Size megabyte File Size of a record record SOFITY List Source File Type Source File Type of a record (TV or Radio) SONAM String Source Name Source Name of a record SOFIEX String Source File Extension Source File Extension of a record SOFISI Float Source File Size megabyte Source File Size of a giving record SOTY List Source Type Source Type of a giving record (Audio or Video) SODA Date Source Date Source Date of a record SOLI String Source Link Source Link of a record 3.3 Transcription Process The speech annotation process includes the segmentation of the sound track, the iden- tication of turns and speakers, and the orthographic transcription. We applied these steps in a parallel manner taking in consideration notes in Orthographic Transcription of Tunisian Arabic [Zribi 2013] and in the Directive of Transcription and Annotation of Tunisian Dialects. The rst voice le (duration of 12:38 sec) was transcribed in the beginning with the Speech Assessment Methods Phonetic Alphabet (SAMPA) for Arabic. We thought that using SAMPA for Arabic will allow a better representation of phonetics. However, during transcribing process we nd that it is better to use the Arabic script including diacritics instead of the SAMPA for Arabic. So we adopt the Arabic transcript in the
  • 40. 3.3. Transcription Process 25 transcription process. Transcribing Arabic spoken recordings is a very long task, especially when using Arabic script and diacritics. For example, a recording that lasts 3 minutes 52 sec, yet took more then 4 hours. Actually, the transcribing of each minute takes at least one hour; 15- 20 minutes for the identication of turns and speakers, 40-45 minutes for transcribing, following rules in our directives. 3.3.1 Transcription Tools There is a variety of transcribing tools (SoundIndex, AMADAT, XWaves, etc) for voice data. We selected Transcriber [Barras 2000] and Praat [D.Weenink 2013] to handle the transcribing task. The software Transcriber was adopted according to the following advantages: Simple user interface: • Supports many languages (including English, French and Arabic). • Easier way to manipulate the voice interval. • Supports the use of keyboard shortcuts in annotation. Very rich in term of annotation: • Dened annotation events (noise list, lexical list, named entities list, etc). • Possibility of edit or add additional annotation. Input le Flexibility: • Accepts long speech File duration. • Supports various le formats (au, wav, snd, mp3, ogg, wav, etc). Better output representation: • Supports many types of encoding (UTF-8, ISO-8859-6, etc). • The output File follows a description schema. Concerning Praat software, the choice was related to our ANLP research group needs. In addition, Praat gives a better representation of speech overlapping and allows speech analyzes.
  • 41. 26 Chapter 3. Data Collection and Transcription 3.3.2 Transcribing Guidelines Transcription guidelines are made in order to be followed during the annotation process of our speech data. We elaborated the Orthographic Transcription of Tunisian Arabic directives [Zribi 2013] which are adapted from the Enriched Orthographic Transcription (TOE in French) [Bigi 2012]. To deal with the TD, some rules have been modied or removed. As the Standard orthographic transcription doesn't take into consideration the observed phenomena of speech (elisions, disuency, liaison, noise, etc), we enriched our directives with them. The Directive of Transcription and Annotation of Tunisian Dialects directive was made to give additional annotation concerning phonemic and phonetic in speech data. Indeed, examples were included to show the application of these rules in the Transcriber software. The directive was adapted from the ESTER2 convention [Cam 2008]. The taking directive took in consideration the specicity of the Arabic language and the TD. The following is a description of some conventions: a) The identication of turns and speakers - Sections : We dened two sections in the audio document: the relevant section identied by the title report and the not transcribed section identied by the title nontrans. The sections not transcribed contains more than fteen seconds such as: • Advertising, weather passages, generic emission, • Applause, • Music, songs, • The beginning or the end of another show dierent from the current emission, • Silence. The other sections are relevant, so they are the only sections that we segment and transcribe. The other sections are relevant, so they are the only sections that we segment and transcribe as illustrated the example in Figure 3.2. The concept of Sections is absent in Praat software, so we don't take in consideration these roles while using it. - Turn-taking First, we identify each speaker involved in the audio document. There are two types of speakers:
  • 42. 3.3. Transcription Process 27 Figure 3.2: Sections and Turns in Transcriber • Global: 1 speakers identied by the syntax First Last name. • Local: 2 speakers identied with the syntax First Last name if possible. Otherwise, we notice them by the syntax speaker # n; where n is the number 1 to n corresponding to the speakers order. The same speaker must always appear with the same identier, also, the list of speakers must contain only speakers involved in the audio document. In addition, we complete all information relative to these speakers such as gender and dialect. Second, we attribute the name of speaker to the Speech Turn. Although, speech turns that do not contain speaker speech, are identied by the syntax no speaker. The example in Figure 3.3 illustrates how to manage speakers in Transcriber soft- ware. To solve the problem of speech overlapping, we adapted the solution mentioned in [Barras 2000] where we create a new speaker that we named with the syntax First Last name speaker 1+ First Last name speaker 2. Praat software gives a better representation of speech overlapping. In fact, the 1It is common to several audio speakers, such as the presenter, journalists, etc 2These are unknown speakers that intervene by telephone for example.
  • 43. 28 Chapter 3. Data Collection and Transcription Figure 3.3: Manage speakers in Transcriber script of each speaker is presented separately in an individual interval as shown in Figure 3.4. - Silence: Silence could be at the beginning, mixed with the transcript, and at the end of a speaker Turn. To avoid such problem, we isolated silence and noise that is beyond 0.5 seconds in a speaking turn no speaker. Also, silence beyond 0.2 at the begin- ning of a Speaker Turn is isolated in a segment of a speech turn no speaker or integrated directly in the last speech turn. In addition, silence above 0.2 seconds at the end of a segment speaker was isolated in a segment of a speech turn no speaker or integrated in the upcoming segment of the speech turn no speaker. Eventually, we added the hashtag symbol # when we have silence between 0.1 and 0.2 seconds in the relevant turn. - Segments: A relevant segment contains an intervention of a speaker and must have a minimum of syntactic and semantic consistency. If a segment of a speech term exceeds fteen seconds we redistributed it in relevant segments.
  • 44. 3.3. Transcription Process 29 Figure 3.4: Manage speakers in Praat b) Orthographic transcription To transcribe the TD, we use the modern standard Arabic transcription rules that do not aect the characteristics of the dialect. We dene also a set of rules that allow transcription of the words in Tunisian Arabic, based on its phonology. - Transcription of the Hamza: Transcribe the Hamza only if it is pronounced. Use only one of these forms of Hamza: d, d, d . If the absence of the Hamza in the dialect word causes ambiguity, we should transcribe it.
  • 45. 30 Chapter 3. Data Collection and Transcription - Transcription of ta marbuta: The ta marbuta ( è) should be written at the end of the word whether it is pro- nounced /a/ or /t/. Example: ék e © ® u (an apple), É © ®¢Ë d éƒ d » (book of child) - Code switching: MSA, TD and foreign languages coexist in the daily speech of Tunisian people. The transcription of the MSA words should respect the transcription conventions of the Arabic language. The foreign words and MSA words should be written respectively using this form: [lan:X, text or word, SAMPA pronunciation] and [lan:ASM, text or word]. We use the SAMPA of the specic language for writing the speaker's pronunciation. Example: • MSA word: [lan:MSA, r F e t» ] • French word: [lan:Fr, informatique, ?anfurmati:ku] • English word: [lan:En, network, na:twirk] - Atypical accords: We choose the standard orthographic transcription of the words as they have been said. Example: ù ë eu F é« g © m Ì 9d (the holidays are wonderful) instead of ét ë eu F é« g © m Ì 9d - Personal pronouns: Pronouns must be transcribed as found in this list: e © u d (I), ú æ © u d (You), Õ æ © u d / eÓñ t © u d (You), e © tm © 9 / e © tk d (We), ñë (He), ù ë (She), eÓñë (They) - Names of months and days: The names of the months and days must be transcribed as in the MSA language. - Axes and clitics: The Table 3.3 lists the dialectal clitics. Note: the È d should be written even if it is pronounced l. This rule is practical also for transcribing words which start with a sun or moon letter. - Named entities:
  • 46. 3.3. Transcription Process 31 Table 3.3: Clitics in the TD language Clitic Enclitic Pronominal enclitic ¼, è, ð, eë, Ñë, Õ», e © u, ø Negation enclitic € Interrogation enclitic ú æ … Proclitic ð, È, r F , ¼, ¨, È d, Ð We used these representations for annotating t % … , PERSON NAME , ½Ó , Place name. Examples: t % … , ú q F m F Ì 9d ˆ eÔ « - Characters: The following phonemes /v/, /g/ and /p/ does'nt exist in the Arabic language. To transcribe them we add ' after these letters. - Incorrect word: When the speaker replaces a letter with an incorrect one, we keep the original letter and we add to it the corresponding correct one. We have to represent these updates, as the following example: x{Correct letter, Original letter}x, x is a letter of the word. c) Rules of marking Transcribe what is heard (hesitation, repetition, onomatopoeia, etc). The transcript should be close to the signal. - Noise: We insert the tag [i] to indicate breathing-inspiration or [e] to indicate breathing- expiration of the speaker. Insert tag [b] to indicate a noise: • Mouth noises (cough, throat noise, laughters, kisses, whisper, etc), • Rustling of papers, • Microphone noise. Insert tag [musique] to indicate music.
  • 47. 32 Chapter 3. Data Collection and Transcription - Punctuation: Punctuate the text with only those punctuations: ., !, ? 3.4 Conclusion A mixture of streaming TV and radio station programs have been collected and adapted crossing the speech data collection process described in this chapter. As a result, 10 les with a more than 1 hour and 25 minutes were transcribed following our guidelines transcription. In fact, the transcribing process was the most laborious stage in the project. During this process we faced some problems with the used transcribing tools such as the slowness of Praat interface while transcribing long audio duration and the inappropriate updating of Arabic script transcription in the Transcriber interface.
  • 48. Chapter 4 Morpho-syntactic Annotation Method 4.1 Introduction Speech and text resources are very rare which is seen as an obstacle in developing application in this eld. In this context, our project will contribute in providing resources by constructing a morpho-syntactic annotated speech corpora for the TD. During this chapter, we will focus on dierent phases of the annotation task which aims to identify the grammatical class of each word. Our method consists in integrating dierent tools and resources to annotate the TD language words. Indeed, we choose two morphological analyzers (MADA and Al-Khalil TD version analyzer) to analyze MSA language words. We use also Al-Khalil TD version and TD dictionary for analyzing TD words. We will present in the rst section a global view of our method and the dierent steps that we have followed to achieve the morpho-syntactic annotation task. Then, we will explain each step in this process with a full description. 4.2 Our main Method In our method, morpho-syntactic annotation process follows dierent steps as described in Figure 4.1. The process starts by extracting speakers' text and some useful word annotations (word language, word named entity, etc) from the transcription le. These pieces of information are then saved in another le with a specic structure to be used in the next analyzing step. Then, we used two morphological analyzers (MADA and Al-Khalil TD version) and a dictionary depending on each word characteristics (word language, word Onomatopoeia, etc) and by applying rules that we have established to judge the suitable grammatical class for each word. Finally, after the analysis's conrmation by the user, we save the result of the morpho-syntactic annotation in an XML le respecting a specic structure. As a result, each word is assigned with a tag that indicates its grammatical class. More details about these steps are introduced in the next sections.
  • 49. 34 Chapter 4. Morpho-syntactic Annotation Method Transcription File Preliminary Step Segmented File Word Analysis Choosing Result Result File Generation Annotated File Tunisian Dialect Dictionary MADA Analyzer Al-Khalil TD version analyzer Figure 4.1: Morpho-syntactic annotation steps in our method 4.3 Preliminary Step The purpose of the preliminary step is to import speaker information and their speech text with all useful annotation (named entities, word language, ..). Indeed, these annotations vary depending on the used transcription tool (Transcriber or Praat). In the Table 4.1, we list the information extracted from annotation in the transcrip- tions les. The process of information extraction is summarized in the following steps. 1. Collect speech text: Speech text for each speaker is divided into many Speech Turns, so we gather them in a unique text for each speaker. 2. Speech text cleaning up: Speech text includes many annotations; some of anno- tations are very useful in the morpho-syntactic annotation process. Some other annotations such as noise and music are removed because they are not useful in this task.
  • 50. 4.3. Preliminary Step 35 3. Split speech text to sentences: Using punctuation annotation (` !',`.',` ?') we divide speech text into a list of sentences. 4. Extract words annotations: Words annotations (Pronunciation notation, Disuen- cies, Named entities, Word language) described in the Table 4.1 are extracted from the transcription le then will be used in the morpho-syntactic analysis process. 5. Generate Segmented Text le: After extracting useful annotation from the tran- scription le, we generate a structured le to be used in the next step. In Table 4.2 we present a description of each tag we have used in the segmentation text le. Table 4.2: Description of tags used in the segmented text le Tag Name Description Text Text The text of a transcribed le sp Speaker Speech The text of each speaker in a transcribed le s Sentence Sentence from a speaker's speech w Word A word from a sentence ponct Punctuation The punctuation of a given sentence could be one of these three values: . or ! or ?. Example: s id=sID1 ponct=. id id In the Text element, the id attribute identify the le an have as a value the name of the le. For the other elements, id identify each ele- ment by following a specic codication: Element tag+ID+Number Example: w id=wID10 ... s id=sID9 ... Continued on next page
  • 51. 36 Chapter 4. Morpho-syntactic Annotation Method Table 4.2 continued from previous page Tag Name Description na Named entity The attribute na is used to identify words which are the name of person or place. Example: w id=wID1613 na=B_N Value  0 ÖÞ … /Value /w w id=wID1614 na=I_N Value ú q F m F Ì 9d /Value /w elis Elision The attribute elis is used to identify a word which contains elision. In fact, the attribute elis con- tains the word with an elision between parentheses and the Value element contains the correct word. Example: w id=wID30 elis= © ‰ © g( d) © რValue © ‰ © g e © tƒ/Value /w wLang Word Language The default language of words in the transcribed text is the TD. All other languages are considered as foreign. Indeed, we take in consideration three possible values for foreign language: Fr: French language En: English language MSA: Modern Standard Arabic language wTrans Word Transliter- ation The transliteration of a foreign word writing in SAMPA pronunciation. Example: w id=wID1612 wLang=fr wTrans= Valueprofesseur/Value /w Continued on next page
  • 52. 4.3. Preliminary Step 37 Table 4.2 continued from previous page Tag Name Description hesi Hesitation The attribute hesi identies an hesitation word. Example: w id=wID1980 hesi= d / onom Onomato- poeia The onom attribute identies an Onomatopoeia word. Example: w id=wID41 onom= d / The Figure 4.2 present an example of a segmented le. Figure 4.2: Example of the Segmented Text 1 ?xml version='1.0' encoding='UTF−8'? 2 Text id=012 3 sp id=1 4 s id=1 ponct=. 5 w id=1 wLang=Fr wTrans=madame 6 Valuemadame/Value 7 /w 8 w id=2 na=S_N 9 Value éÒ£ e © ¯/Value 10 /w 11 w id=3 12 Valueú © ¯/Value 13 /w 14 w id=4 15 ValueÉ’ © ¯/Value 16 /w 17 w id=5 hesi=È d / 18 w id=6 19 Value © ­t ’Ë d/Value 20 /w 21 !−− the remainder of words −− 22 /s 23 !−− the remainder of sentences −− 24 /sp 25 !−− the remainder of speakers −− 26 /Text
  • 53. 38 Chapter 4. Morpho-syntactic Annotation Method 4.4 Word analysis To begin with, the characteristics of word are used to assign it with the relative tag. Indeed, words that have characteristics as onomatopoeia, named entities and word language (French or English) are assigned respectively with the tags Onomatopoeia, Named entity and Not-Recognized. Then, we used two analyzers (MADA and Al-Khalil TD version) and dictionary, according to the rules dened below, to identify the grammatical class of each word. Two ways of process may arise according to the language of each word: Tunisian dialect words and modern standard Arabic words. We detail in the following these ways. a) Tunisian Dialect Words Figure 4.3: Analyzing process of a TD word Analyzing a TD word follows dierent steps as described in Figure 4.3. Indeed, to analyze a TD word, we start analyzing it by using TD dictionary as described in Figure 4.4. If this process does not give any analysis, we analyze the given word by using
  • 54. 4.4. Word analysis 39 Al-Khalil TD version analyzer. If the analyzed word is neither recognized by the TD dictionary or Al-Khalil analyzer, we remove its diacritics and we reanalyze it with the TD dictionary. Analyzing a word without diacritics allows us to solve the problem of dierence in writing diacritics of same words. For example word © Ẃ) could be written © Ẃ) or © Ẃ) according to the dialect of the speaker. If these process do not lead to any analysis, we reanalyse the given word without diacritics by using Al-Khalil TD version analyzer. Finally, if there is no possible analysis we analyze the given word with MADA analyzer. Indeed, many words without diacritics are written at the same as in MSA or TD. Thus, analyzing these words with a MSA analyzer could succeed in getting possible analysis. During this process, if there is more than one analysis we precede to organize them. When there is no possible analysis for a given word in our method, we assign it with the tag unknown. Furthermore, our system allows the user to interfere by choosing the correct one or by adding a new analysis. In the following Figure 4.4, more details about the used procedure mentioned above. Figure 4.3 illustrates these steps. Furthermore, our system allows the user to interfere by updating analysis or by adding other analysis. Analyzing with TD Dictionary: Analyzing a word using TD dictionary is handled by applying a morpheme segmenta- tion method as used in [Yang 2007]. Figure 4.4 shows the dierent steps which had been taken to analyze a word with the TD dictionary. These steps are executed sequentially until we get an analysis result. First, we search in the TD dictionary according to this order: Conjunctions, Pronouns, Number words, Interjections, Particles, Adjectives, Adverbs, Nouns, Verbs. Second, we look sequentially at Adverbs, Nouns, Verbs dictionary through trying all possible TD prexes. Third, we do the same task with suxes. Finally, we resume the same procedure with prexes and suxes. Analysis Ranking: We classify and conrm each word analysis according to the following order: TD Dictionary, Al-Khalil TD version analyzer, MADA analyzer. Furthermore, if these tools and resources give the same analysis, we keep one of them.
  • 55. 40 Chapter 4. Morpho-syntactic Annotation Method Figure 4.4: Analyzing a word with TD dictionary b) Modern Standard Arabic Words We used two analyzers MADA and Al-Khalil to analyze MSA word. First we use MADA. Then, if there is no possible analysis, we analyze with Al-Khalil analyzer. More- over, if there is no possible analysis we remove diacritics and we reanalyze them again with Al-Khalil. 4.5 Choosing results While using Al-Khalil adapted analyzer and Dictionary. A problem of having many analysis for the same word could appear. This problem is caused by the dierence between word diacritics writing or by ambiguity.
  • 56. 4.5. Choosing results 41 Problem with Al-Khalil analysis: Usually, Al-Khalil adapted analyzer return a list of analysis for a given word with dif- ferent information (gender, prex, sux, gender, number, person,voice, etc). In general, this problem which is related to the ambiguity in Arabic language. Problem while using dictionary: The problem in dictionary analysis is related to word diacritics writing. To solve the problem, we try to classify these results by comparing their distance to the original word. Indeed, we used the Levenshtein distance [Haldar 2011] for measuring the dierence between two sequences or words. Mathematically, the Levenshtein distance between two strings a, b is given by leva,b(|a|, |b|) where: leva,b(i, j) =    0 , i = j = 0 i , j = 0 and i 0 j , i = 0 and j 0 min    leva,b(i − 1, j) + 1 leva,b(i, j − 1) + 1 leva,b(i − 1, j − 1) + [ai = bj] , else • leva,b(i − 1, j) + 1: The minimum corresponds to removing (from a to b). • leva,b(i, j − 1) + 1: The minimum corresponds to insertion (from a to b). • leva,b(i − 1, j − 1) + [ai = bj] : The minimum corresponds to match or mismatch (from a to b), depending on whether the respective symbols are the same. Example: Original word: e © t( ‚Ó Word returned while analyzing: e © t( ‚ Ó ⇒ Distance between two words is 2 as calculated in Table 4.3. When we have analysis from both Al-Khalil TD version analyzer and TD dictionary (case that we have problem with diacritics writing), we used the dictionary analysis to conrm Al-Khalil analysis if they have the same grammatical function. Finally, we generate the annotation le as described in the following section.
  • 57. 42 Chapter 4. Morpho-syntactic Annotation Method 4.6 Result le generation These tags are while generating the annotation le are presented in Table 4.4. Table 4.4: Description of tag used in the annotation result le Tag Name Description asp aspect The aspect (orders or requests, perfective, imper- fective) vox voice The voice (active, passive, etc.) stt state The state (indenite, denite, construct, etc.) per person The person (1st, 2nd, 3rd) num number The number (plural, dual, singular) gen gender The gender (feminine, masculine) case case nominative The case nominative accusative genitive sux sux The sux of the word pattern pattern The pattern of the word root root The root of the word stem stem The stem of the word spee part of speech The part of speech of the word prex prex The prex of the word 4.7 Conclusion A lot of problems complicated the used tools integration process. These were mainly due to the lack of dierent input/output formats of every tool and granularity of tag sets. In addition, other problems are faced while using the analysis tools. For example, MADA analyzer ignores all word characteristics such as the language of the word which could aect the analysis result. Another problem occurred when Al-Khalil analyzer return many analysis, so the user has to intervene to choose one of them.
  • 58. 4.7. Conclusion 43 Table 4.1: Annotations extracted from the transcription les Pronunciation notation - Elongation - Liaisons - Elisions - Incomprehensible sequence Disuencies - Incomplete word - Onomatopoeia Named entities - Person name - Place name Word language - TD - MSA - French - English
  • 59. 44 Chapter 4. Morpho-syntactic Annotation Method Table 4.3: Levenshtein distance table example 1- 0 1 2 3 4 5 6 7 Ð € d ø è © à d d -1 0 1 2 3 4 5 6 7 8 0 Ð 1 0 1 2 3 4 5 6 7 1 € 2 1 0 1 2 3 4 5 6 2 d 3 2 1 1 2 3 4 5 6 3 ø 4 3 2 2 1 2 3 4 5 4 © à 5 4 3 3 2 2 2 3 4 5 d 6 5 4 3 3 3 3 2 3 6 d 7 6 5 4 4 4 4 3 2
  • 60. Chapter 5 Realization and Performance Evaluation 5.1 Introduction The expansion of NLP application for dialectal requires a high amount of resources in terms of data and tools. By developing a morpho-syntactic annotation tool for the TD language, we facilitate the morpho-syntactic annotation task which promotes to the build of corpus. In this chapter, we introduced the used tools and resources used in our tool. Then, we present our TD annotation tool by explaining the dierent modules, functionality, and by providing some details about the development environment. Finally, we experiment our tool and we discuss the results obtained in dierent assessments. 5.2 Tools and Resources The morpho-syntactic annotation process contains several tasks that can be handled using tools and resources. In this section, we introduced the used analyzers and dictionary. 5.2.1 Al-Khalil analyzer The Al-Khalil analyzer was developed to produce tags for a given text by executing a morphological analysis of the text. The lexical resource consists of several classes that handle vowelled and unvocalized words. The main process is based on using patterns for both verbal and nominal words; Arabic word roots and axes. Indeed, according to [Altabba 2010] Al-Khalil analyzer is still the best morphological analyzer for Arabic. In addition, Al-Khalil won the rst prize at a competition by The Arab League Educational, Cultural Scientic Organization (ALESCO) in 2010.
  • 61. 46 Chapter 5. Realization and Performance Evaluation 5.2.2 MADA analyzer The MADA+TOKAN toolkit is a morphological analyzer that is introduced by [Habash 2009] and used to derive extensive morphological and contextual information from Arabic text. Indeed, the toolkit includes many tasks; high-accuracy part-of-speech tagging, diacritization, lemmatization, disambiguation, stemming, and glossing. In ad- dition, MADA classies the analysis results and gives as an output the more suitable analysis to the current context of each word. The analysis results carry a complete diacritic, lexemic, glossary and morphological information. Also, TOKAN takes the information provided by MADA to generate to- kenized output in a wide variety of customizable formats which allow an easier extraction and manipulation. In addition, MADA achieved an accuracy score of 86% in predicting full diacritization and 96% on basic morphological choice and on lemmatization. 5.2.3 Al-Khalil TD version Recent research in our ANLP group [Amouri 2013] conducted the study of the dialect language by adapting the Al-Khalil analyzer to the TD language. Thanks to the enrich- ment of the transformation rules, the adapted analyzer achieves a score of 81.17% and 96.64% in terms of recall and accuracy for verbs correctly analyzed. 5.2.4 Tunisian Dialect Dictionary The TD dictionary [Ayed 2013, Boujelbane 2013] were constructed using lexical units in the Arabic Tree bank corpus and their parts of speech, to convert words from the MSA to TD language. The obtained results consist on an XML lexical database composed of nine dictionaries (Conjunctions, Pronouns, Number words, Interjections, Particles, Adjectives, Adverbs, Nouns, Verbs). 5.3 Tunisian Dialect Annotation Tool This section is dedicated to present our TD annotation Tool. Indeed, the rst part will clarify the usefulness of our system and its functionality. The second part is divided to clarify details, characteristics, and the development environment.
  • 62. 5.3. Tunisian Dialect Annotation Tool 47 5.3.1 Process of TDAT In order to specify and visualize the artifacts of our system, we will detail its func- tionalities and its manipulation procedure. Also, we will introduce the structure of our system. a- System functionalities The principal functionality of our system is to generate a morpho-syntactitc annotation for each word in the transcription le. Indeed, this process is composed of two basic steps; The rst is to segment the transcription le. The generated segmented le follows a unique XML structure which allows a better representation of the text and speech phenomena. The second step requires as an input a segmented text le, then analyses each word by determining its suitable grammatical class. Our annotation system allows an easier manipulation of the obtained analysis result. Indeed, the user could show, update, save an annotation le, and open an unaccomplished annotated le to complete it. Additionally, the user has the possibility to select through the available option of dictionary and morphological analyzers which will be used in the morpho-syntactitc annotation process. • Segment a Transcription File: The aim of the segmenting script is to prepare the transcription le to the entry of our TDAT. Indeed, this tool will allow our TDAT to support dierent transcription les types. Currently, the developed tool supports two transcripts les formats (trs and TextGrid). The generated le (Segmented Text) follows a unique structure (XML) that allows a better representation of speech phenomena for the user. Indeed, the used structure was created in accordance with the TEI recommendations. Furthermore, the segmented text allows an easier interpretation of the orthographic transcription by our system. Also, the Segmenting tool generates three other les: - Words' list: contains a list of all words and their frequency. - Sentences' list: contains a list of all sentences. - Statistics: contains some useful statistics about the content of the transcription le (See Figure 5.1 for more details).
  • 63. 48 Chapter 5. Realization and Performance Evaluation 1 ?xml version='1.0' encoding='UTF−8'? 2 STATISTIC id=110 3 Speakers3/Speakers 4 Sentence267/Sentence 5 Words3231/Words 6 Hezitation117/Hezitation 7 Onomatopoeia150/Onomatopoeia 8 Elisions102/Elisions 9 NamedEntity54/NamedEntity 10 /STATISTIC Figure 5.1: XML Shema of the Annotated Text In order to segment a transcription le (trs or TextGrid), the user has to open or select a transcription le in the corpus les tree (see (2) in Figure 5.3 for more details). Then, our tool starts the process by opening the transcription le. Indeed, an error message appears when there is a problem while executing this process such as unexpected le format. Along this process, the system informs the user about the progress (See (4) in Figure 5.3). Then, our tool leads the generated segmented text le and asks for showing the obtained results. Finally, our system loads and shows in the statistic menu the statistics information (number of words, sentence, speaker, hesitation, onomatopoeia, elision and named entity) relative to the segmented le (see (2) in Figure 5.4 for more details). The user is noticed if there is a problem while loading the segmented le or the statistic le. • List of Words Frequency: To show the list of words frequency, an already segmented le must be opened or selected in the segmented le list. Indeed, our system leads the list of words frequency le relative to the selected segmented le. If there is a problem while loading the list of words frequency, an error message will appear to inform the user. • Annotation options: To analyse a word, our system uses analyzers and a dictionnary. Those choices could be updated before annotating a le by choosing from the available analyzers and dictionary which one to use. In addition, the path of the used resources could be easily updated (See Figure 5.8). These options take place after saving when a new annotation process is launched.
  • 64. 5.3. Tunisian Dialect Annotation Tool 49 • Segmented File Annotation: The annotation process starts when the user wants to annotate a transcribed le. Indeed, the user has the choice to segment an opened transcription le in the segmented le list or to open an incomplete annotated le. Then, switch the selected resources option (Dictionary and analyzers), the developed system launches the analysis process. During this process, the analysis results appear progressively in the annotation window (see (1) in Figure 5.6 for more details) and another window appears to inform the user about the progress (see (2) in Figure 5.6). In addition, our system gives the user the possibility to show a recap of the analyzed words along the annotation process (see Figure 5.7 for more details). If a problem occurs while analyzing a word or executing one of the used morphological analyzer tool an error message will appear in the console. Also, the user could stop all the current process at any time (see (2) in Figure 5.6). When there is more than one analysis for a given word, case of ambiguity, the user has to intervene to select and to conrm the right analysis for the word (see Figure 5.5). After conrming all the analysis, the user could save the annotation le, otherwise, the system saves the incomplete le by conserving all the analysis for each word that has not been conrmed. To save an annotation le, the user has to select le format and the result directory path. Finally, a message will appear to inform the user that the annotation le has been successfully saved and a new window containing the obtained result appears (See Figure 5.9). Although, a message will appear to inform the user that there is a problem while saving the annotation le. • Update Analysis Results: The user has the possibility to update the analysis results while or after analyzing. To update the analysis results, the user has to select the appropriate grammatical class of a given word from the analysis list returned by our system. In addition, the user could add a new analysis by typing the additional information such as prex and sux (See Figure 5.5). After conrming the new analysis, the system updates the annotation window by adding the new tag in the top of the relative word analysis list. Indeed, the new
  • 65. 50 Chapter 5. Realization and Performance Evaluation added analysis is considered as the best analysis so there is no need to conrm it later. b- System Collaborations: Our system collaborates with other tools to generate the more suitable grammatical class for each word. Indeed, our system interacts with: • MADA analzyer: to analyze a text. • Al-Khalil Analyzer: to analyze word. • Al-Khalil TD analyzer: to analyze word. • Perl Script: to segment a transcription le. 5.3.2 Realization We choose JAVA programming language to develop our system for many reasons such as: • JAVA is one of the most popular programming languages in use 1 in 2012 thanks to its simplicity. • It is a platform-independent at both the source and binary levels. • It allows creating modular programs which allows an easily use of predened struc- ture of other projects in particular Al-Khalil source code. Furthermore, we used multi-threaded to perform several tasks simultaneously especially while performing the annotation process. By using this technology, rst of all, we in- creased the processing speed. Second, we allowed a direct result display which allows the user to intervene to conrm the returned analysis instead of waiting the end of the whole annotation process. The selection of the development environment Eclipse will allow us to program simul- taneously with several programming languages in particular PERL and Java. Besides, its extensibility in terms of programming language, this multi-platform environment is already used in the development of Al-Khalil analyzer. Indeed, we conserved the same project characteristics such as the le text encoding (Cp1256). 1http://en.wikipedia.org/wiki/Java_(programming_language)
  • 66. 5.3. Tunisian Dialect Annotation Tool 51 In addition, we chose to work in the LINUX system environment to take advantage of the speed. Also, this will allow a better performance of MADA analyzer which work basically in this environment. Indeed, MADA analyzer is completely build in PERL programming language. To manipulate the segmented le and the annotation le, we used the XPath ex- pression language. Indeed, the XPath language is based on a tree representation of the XML document, and provided the ability to navigate around the tree by selecting nodes with a variety of criteria. The XPath was dened by the World Wide Web Consortium (W3C) and its use in our environment requires the JDOM package. Thus, we imported the JDOM (version 2.0.0) library to our project. Through the use of Perl programming language, processing large le size does not exceed one second. Indeed, PERL provides many predened methods that allows an easier manipulation and generation of text le and considered as the leader in the eld of text le processing. Updating Perl code is also quite easy and avoids the update of the whole application. Furthermore, PERL is a multi-platform language and usually pre-installed in the LINUX system. As we used MADA analyzer, using this programming language will perform our TDAT and do not need any extra requirement. Also, we added EPIC plug-in to the development environment Eclipse in order to manipulate PERL script. 5.3.3 TDAT Interfaces Main Interface The main interface is divided into four parts: Main menu: The purpose of the main menu (1) in Figure 5.2 is to provide an easy access to all used les while using our application. Indeed, the menu is classied according to the format of the les: • File: File management • Transcription: Management of the transcriptions les. • Segmented Text: Management of the segmented text les.
  • 67. 52 Chapter 5. Realization and Performance Evaluation Figure 5.2: Main Interface in the TDAT • Annotation: Management of the annotation les. Speed access menu: The main purpose of the speed access menu (2) in Figure 5.2 is to easily access and use the basic functions of our application that is why we classied these functions in four themes: 1. Corpus 2. Transcription 3. Annotation 4. Statistics Transcription interface The content of a transcription le could be visualized (see (1) in Figure 5.3) through the use of the corpus tree (see (2) in Figure 5.3). Indeed, the corpus tree contains all the transcription les in the corpus.
  • 68. 5.3. Tunisian Dialect Annotation Tool 53 Figure 5.3: Transcription window in the TDAT Segmented text interface To start segmenting (see (3) in Figure 5.3) a transcription le, the user has to choose a le from the corpus tree. The segmentation process is presented as in Figure 5.3. When the process is accomplished a new window that contains the segmented text appear (see (1) in Figure 5.4). In addition, the statistics menu (see (2) in 5.4) shows the content of its relative statistics le. Add analysis interface The user could update analysis results by selecting the right grammatical class for each word (see Figure 5.6). Furthermore, the user could add a new analysis for a given word by choosing the grammatical class and writing its prex and sux in the Add analaysis interface (see Figure 5.5). Analysis interface The analysis in the Figure 5.6 appears progressively. As well, the status icon changes ac- cording to the analyze advancement. Also, another window appears to show the progress of the used options (see (2) in 5.6).
  • 69. 54 Chapter 5. Realization and Performance Evaluation Figure 5.4: Segmented Text window in the TDAT Figure 5.5: Add an analysis result window in the TDAT The Table 5.1 describes each icon signication: Table 5.1: Description of icons used in the annotation interface
  • 70. 5.3. Tunisian Dialect Annotation Tool 55 Figure 5.6: Analyze window in the TDAT Icon Description No possible analysis One analysis Many analysis Add an analysis Conrm an analysis Analysis given by MADA analyzer Analysis given by Al-khalil analyzer Analysis given by TD Dictionary An analysis proposed by the user An Analysis extracted from the transcription
  • 71. 56 Chapter 5. Realization and Performance Evaluation Analysis details interface Figure 5.7: Analyse Details window in the TDAT The statistic button in the annotation interface (see (3) in Figure 5.6) gives the user details about the recognized words (see Figure 5.7). Indeed, these statistics are updated automatically while the analysis process. Annotation options interface Analysis result le interface To save the analysis, the user has to choose the le format and to enter the result le name and path. The generated annotation le is showed in gure 5.9.
  • 72. 5.4. Evaluation 57 Figure 5.8: Annotation options window in the TDAT 5.4 Evaluation Evaluate a morpho-syntactic annotation system allows to determine its capabilities and diagnose its strengths and weaknesses. Thus, the evaluation process requires a lot of objectivity. The most known evaluation method is to compare the performance of the developed system with other similar systems. Such system must has the same input and output. Also, the capability of analyzing word with its diacritics could be a decisional factor in the evaluation. Another method was used in state-of-the-art to evaluate a morphological analyzer system which is based on comparing the analysis results with a gold standard. Apparently, there is no available gold standard morpho-syntactic annotation for the TD language yet. Thus, we developed a gold standard for evaluating our tool as described below. Then, we evaluate three basic modules of the developed system. Finally, we conclude the strengths and weaknesses of the developed TDAT.
  • 73. 58 Chapter 5. Realization and Performance Evaluation Figure 5.9: Analysis Result File window in the TDAT 5.4.1 Gold standard for the TD language In order to evaluate our system, we aimed to develop a gold stander for the TD language which is composed of two annotated transcription les (2409 words). Indeed, these morpho-syntactic annotation were created manually by an expert in linguistics. The used annotation tags are the same in the TD dictionary (Conjunction, Pronoun, Number word, Interjection, Particle, Adjective, Adverb, Noun, Verb). Also, we included the sux and prex as an additional information. 5.4.2 Evaluate the TDAT We choose to evaluate three modules of our system: Evaluate the words segmenter: One of the basic tasks while analyzing a word using dictionary is to determine their prex and sux. Indeed, words in dictionary could include sux and prex, so the system has to decompose it using our module of words segmenter. In order to evaluate this module, we rst analyze words without segmenting it, then we used our segmenter to identify possible prex and sux for each word not recognized by our system. Example:
  • 74. 5.4. Evaluation 59 The following word e êË e © t( ‚ Ó could not be recognized while analyzing it with the used dictionary. However, by using our word segmenter module in the dictionary analyzer, our system identies its sux (e êË) and annotate it as a verb. The evaluation results in terms of Accuracy are detailed in Table 5.2. The Accuracy is described as: Accuracy = TP+TN TP+TN+F = T AW • TP: Word recognized and correctly analyzed • TN: Word recognized and not correctly analyzed • F: Word not recognized • T: All recognized word • AW: All analyzed words Table 5.2: Evaluation results of the word segmenter module Recognized Not recognized Accuracy Before using the Segmenter module 239 826 22,44% After using the Segmenter module 410 655 38,49% Indeed, by integrating the segmenter module, we achieved an improvement of 16.05% in terms of Accuracy while using the dictionary module. Evaluate the Levenshtein distance: Analyzing a word with the TD dictionary could lead much analysis which is primarily due to the dierence in diacritics writing. After testing our system in a transcription text that is composed of 1065 words, we nd out that 47% (81 cases) of the analysis obtained from the dictionary module are ambiguous. Indeed, multiple choices appear for each word which is owing to the use of segmenter module. To rank these analysis, we used the Levenshtein distance.
  • 75. 60 Chapter 5. Realization and Performance Evaluation Example: Word: e ëñt ¢ ª © u Possible analysis without applying the Levenshtein distance: ù ¢ ª © u, ñt ¢ ª © u Possible analysis after applying the Levenshtein distance: ñt ¢ ª © u, ù ¢ ª © u In this example the system selects in the rst position the word ñt ¢ ª © u which represents the shortest distance comparing to the original word e ëñt ¢ ª © u. In order to evaluate the usefulness of the employment of the Levenshtein distance function, we evaluate the Precision score of the TD dictionary module. The Precision formula is dened as: Precision = TruePositive TestOutcomePositive = CorrectAnalaysisWord AllRecognizedWord As our system considers the top list analysis of a word as the best analysis, organizing these results analysis using the Levenshtein distance will improve our system performance. In Table 5.3 the evaluation results of the TD dictionary module before and after the employment of the Levenshtein distance procedure. Table 5.3: Evaluation results of the TD dictionary module using the Levenshtein distance function Correctly analyzed Not correctly analyzed Precision Before using the Levenshtein distance 52 29 64.19% After using the Levenshtein distance 77 4 95.06% We notice that in some cases organizing word based in its sux and diacritics distance do not solve the problem of ambiguities. Indeed, the integration of an classication module based on word sentence context could solve the ambiguity problem. Evaluate the analysis results: The main function of our tool is to deliver as an output a morpho-syntactic annotation for each word in the input. In order to evaluate this module, we used our Gold standard as a corpus test.
  • 76. 5.5. Conclusion 61 The evaluation results of the TDAT while analyzing using all resources options are de- tailed in Table 5.4. Table 5.4: Evaluation results of the TDAT Recognized Not recognized Correct analysis 1803 251 Incorrect analysis 355 We conclude a Precision score of 83.54%. Indeed, the incorrect analysis are all most Adjective interpreted as Verb. Thus, the problem was detected when analyzing complex TD word with Al-Khalil analyzer. The used patterns give an incorrect interpretation when removing the word suxes. When using the MADA analyzer, some words are tagged as Noun which is an incor- rect analysis caused by the dierence in meaning of these words in the TD and MSA languages. Analysis errors when using the TD module are owing to the diacritics dierence. This problem is caused by the TD language speaking dierence. We used the F-measure to study the results analysis quality. The F-measure is dened as: F-score1 = 2∗TP 2∗TP+F • TP: Correct analysis word • F: Not recognized word We obtained an F-score1 of 91.03% which is considered a promising result compared to the existing tool for TD language. 5.5 Conclusion In this chapter we presented the TDAT which is developed to handle the morpho- syntactic annotation task. In order to allow an easier use and expend of our project,
  • 77. 62 Chapter 5. Realization and Performance Evaluation we used free software that supports multi-platform. As well, the TDAT use dierent options of resources which led to obtain a detailed analysis. Despite the dierence in the needed time while analyzing comparing to the other used tools, MADA analyzer is still very useful when using transcript that contains ocial discussion such as TV dialogue. The obtained result when using all resources options are very promising. Indeed, we achieved an F-score1 of 91.03% while testing in our test corpus. In addition, the developed tool could be improved by classifying the results analysis. Also, the enrichment of the used TD dictionary could load to achieve better result especially for Noun.
  • 78. Conclusion The tools for dialectal Arabic are few and often miss certain features or do not reach up to the same standard as their MSA counterpart. In fact, there is a need for resources and tools for the Arabic dialects in order to start creating new and better NLP applications. By developing the TDAT, we aimed to provide a tool that accepts dierent types of transcription format in order to produce morpho-syntactic annotation for TD words. In order to build a morpho-syntactic annotation corpus for the TD language, we started by collecting speech data. Then, we transcribed the collected data following our or- thographic transcription guidelines and using two transcription tools (Transcriber and Praat). Finally, we developed a tool that accepts the elaborated transcription les as an input and gives as an output a morpho-syntactitc annotated le. To handle this task, our tool uses TD dictionary and two analyzers (Al-Khalil TD analyzer and MADA analyzer). In addition, the used resources options could be easily updated. During the transcription process we created a corpus that consists of more than 1 hour and 25 minutes. A portion of the developed corpus was used to train the developed system. In order to determine the capabilities of our TDAT tool to analyse TD text, we con- structed a test Gold standard for the TD language. We used a portion of this corpus to test the dierent modules of our tool. The evaluation results show that the used seg- menter module realizes a score of 95.06% in terms of precision. However, using dictionary for analyzing, regarding the obtained Accuracy score, still need more improvement by en- riching the used dictionary especially the Noun dictionary. Due to the use of Al-Khalil TD and the other resources options, our tool attains an F-score1 score of 91.03%. The developed corpus could be enlarged it by integrating other topics. Furthermore, our corpus contains dierent subjects that can be used in learning a linguistic analysis models, in the automatic speech processing, or in any other areas of natural language processing. The analysis results obtained by our developed tool could be improved by: Enlarging the TD lexical database. The use of a classication module, based on statistics for example, to classies analysis results. Also, updating Al-Khalil pattern and database by studying the new ambiguous case while analyzing. The input of our system could be
  • 79. 64 Conclusion modied to support other speech text format such as internet web page since the dialectal language is in increase use in social network.