Discourse annotation for arabic 2
Upcoming SlideShare
Loading in...5

Discourse annotation for arabic 2






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Discourse annotation for arabic 2 Discourse annotation for arabic 2 Presentation Transcript

  • Survey on DiscourseAnnotation for ArabicA. Algarni, H. Alharbi and N. AlmutairySupervisor: Dr. A. AlsaifApril 23, 2013Kingdom of Saudi ArabiaMinistry of Higher EducationImam Mohammed Ibn Saud Islamic UniversityCollege of computer and Information SciencesCS465 - Natural Language Processing –1
  • Outline Introduction The Leeds Arabic Discourse Treebank Discourse Connective Recognition Discourse Relation Recognition Semantic-Based Segmentation Discourse Segmentation Based on RhetoricalMethods A Comprehensive Taxonomy of Arabic DiscourseCoherence Relations2
  • Introduction Linguistic annotation covers any descriptiveor analytic notations applied to raw languagedata. Annotated Discourse Corpora can be veryuseful to facilitate theoretical studies alongwith contributing in the development of NLPapplications.3
  • Applications Information extraction Question-answering Summarization Machine translation, generation.4
  • Discourse Relations andDiscourse Connectives Discourse Relation is the way that twoarguments (text segments) logically connected. Temporal, Comparison, Causal, Expansion..etc Discourse Connective (DC) :A lexical markerused to link two abstract objects in a text. Abstract Object (AO) : Abstract objects indiscourse are things like proposition, events, facts and opinions. Argument (Arg) : A text expressing an abstractobject and linked by a DC.5
  • The Leeds Arabic DiscourseTreebank6• First effort towards producing an ArabicDiscourse Treebank was introduced in 2011by A. Alsaif and K. Markert.• Collected a large set of Arabic discourseconnectives using text analysis and corpusbased techniques.•Final list contains 107 discourseconnectives.
  • Types of Discourse connectives7
  • Types of Relations8
  • Types of Relations Cont.. COMPARISON.Similarity:9
  • Arabic Discourse Annotation Tool(ADA) and Annotation Process10
  • Annotation Methodology1. Measuring whether annotators agree onthe binary decision on whether an itemconstitutes a discourse connective incontext.2. Measuring whether annotators agree onwhich discourse relation an identifiedconnective expresses. As annotators canuse sets of relations for a connective.11
  • Results Agreement in task 1 is highly reliable(N=23331) percentage agreement of0.95, kappa of 0.88. Agreement in task 2 (relation assignment)is relatively low (N=5586), percentageagreement of 0.66, kappa 0.57, and alphaof 0.58.12
  • Discourse Connective Recognition To distinguish between discourse and non-discourse usage of a connective. Example: once, while. A. Alsaif and K.Markert (2011) introduceda Connective identifier for Arabic based onsyntactic features.13
  • Discourse Connective Recognitionby A. Alsaif and K.Markert (2011)Features: Surface Features (SConn) Lexical features of surrounding words(Lex) ExampleArg1DCArg2.[Children might be tired]Arg1 [and]DC [feel sleepy]Arg2 during school time if they didnot sleep well14
  • Features: Part of Speech features (POS) Syntactic category of related phrases(Syn) (E.g.: / the school isvery large and beautiful) Al-Masdar feature.Discourse Connective Recognitionby A. Alsaif and K.Markert (2011) Cont…15
  •  ResultsDiscourse Connective Recognitionby A. Alsaif and K.Markert (2011) Cont…Features Acurr KBaseline (not Conn) 68.9 0M1 Conn only 75.7 0.48Tokenization by white space + auto taggerM2M3M4Conn+ SConn+LexConn+ SConn+Lex+POSConn+SConn+Lex+POS+Masdar85.6 0.6287.6 0.6988.5 0.70ATB-based featuresM5M6M7Conn+SConn+LexConn+SConn+Lex+Syn/POSConn+SConn+Lex+Syn/POS+Masdar86.2 0.6591.2 0.7992.4 0.82M8M9Conn+SConn+SynSConn+Lex+Syn+Masdar91.2 0.7991.2 0.7916
  • Discourse Relation Recognition To identify the type of the relation A. Alsaif and K.Markert (2011) introducedthe first algorithms to automaticallyidentify relations for Arabic17
  • Features: Connective features Words and POS of arguments Masdar Tense and Negation Length, Distance and Order Features Argument Parent Production RulesDiscourse Relation Recognitionby A. Alsaif and K.Markert (2011)18
  • ResultsAcurr kFeaturesAll connectives (6039)52.5 0Baseline (CONJUNCTION)77.2 0.6078.7 0.6678.3 0.65Conn only (1)Conn+Conn f+ Arg f (37)Conn+Conn f+ Arg f+ Production rules (1237)M1M2M3Excluding wa at BOP (3813)35 0Baseline (CONJUNCTION)74.3 0.6577.0 0.6976.7 0.69Conn only (1)Conn+Conn f+ Arg f (37)Conn+Conn f+ Arg f+ Production rules (1237)M1M2M319
  • ResultsAcurr kFeaturesAll connectives (6039)62.4 0Baseline (EXPANSION )88.7 0.7888.7 0.78Conn only (1)Conn+Conn f+ Arg f (37)M1M2Excluding wa at BOP (3813)41.8 0Baseline (EXPANSION)82.7 0.7483.5 0.75Conn only (1)Conn+Conn f+ Arg f (37)M1M220
  • Semantic-Based Segmentation ofArabic Texts Corpus Analysis Definition: Let L be a list of candidatesegments connectors, each element c in L isclassified based on its effects on the textsegmentation as either active or passive Examples:.1[][[.2]][][21
  • Segmentation Process Identifying the connectors that indicatecomplete segments. Locating the active connectors. Resolving the case where adjacent activeconnectors exist. Setting the segments boundaries. Creating the final list of segments.22
  • Discussion evaluate the segmentation process, theycollected ten essays. Each essay ranges between 500 and 700words. After implementing the segmentationprocess. Gave the output to judges to evaluatethem in terms of two factors: correcthit and incorrect hit.23
  • Discussion Cont..Incorrect hitCorrect hitEssay0331115202531234020512961267233802690221024
  • Arabic Discourse SegmentationBased on Rhetorical Methods This Method is depends on the meaning ofthe connector " " in Arabic language. There are six types of " " classified intotwo classes, "Fasl" and "Wasl " : "Fasl " : segmenting place. "Wasl " : unsegmenting but connectingthe text.25
  • Types of Connector " "ClassExampleTypeFaslFaslFaslWaslWaslWasl26
  • The Arabic sentenceSegmentation System27
  • Feature Extraction•The following are the features of " ":X3 = noun and X7 = accusative mark.28
  • Experiment and Results They used 1200 instances for training. They used 293 instances for testing aftertesting there are 290 correct and 3incorrect instances. The result with:94.68%Recall96.82%Precision98.98 %Accuracy29
  • A Comprehensive Taxonomy of ArabicDiscourse Coherence Relations Coherence relations are classified into twotypes: explicit relations and implicitrelations.exampleCoherence relationsI am very happy because I gotexcellent marks in exams.Explicit relationsI am very happy. I got excellentmarks in exams.Implicit relations.30
  • The procedure of creating an ArabicTaxonomy of Coherence Relations31
  • Examples of Implicit Arabicrelations "Impossible condition / " : "Cascaded questioning/ :(32
  • Results They got a set of 47 Arabic coherencerelations.coherence relations.ResultFrom English coherencerelations.31additional Arabic explicitcoherence relations.12Arabic implicit relations.433
  • ConclusionDiscourse Annotation is a very fertile fieldand it has many NLP applications, forArabic there are some challenges due tothe lack of annotated corpora and studies.34
  • Thank You35