Reference Scope Identification
in Citing Sentences
         Authors:
                 Amjad Abu-Jbara, Dragomir Radev
                           (University of Michigan)
            Conference:
                                      NAACL 2012
            Expositor:
                                  Akihiro Kameda
              (Aizawa Lab. The University of Tokyo)
Abstract
●   Problem:
    ●   Multiple citation in one sentence
    ●   There are many POS taggers developed using
        different techniques for many major languages such
        as transformation-based error-driven learning (Brill,
        1995), decision trees (Black et al., 1992), Markov
        model (Cutting et al., 1992), maximum entropy
        methods (Ratnaparkhi, 1996) etc for English.
●   Approach:Prepossessing
         and 2+1+2*3+1=10 methods
Preprocessing & Methods
Reference Preprocessing
    (tagging, grouping, non-syntactical element removal)
●   These constraints can be lexicalized (REF.1; REF.2),
    unlexicalized (REF.3; TREF.4) or automatically learned
    (REF.5; REF.6).

●   These constraints can be lexicalized (GREF.1), unlexicalized
    (GTREF.2) or automatically learned (GREF.3).

●   (GTREF.1) apply fuzzy techniques for integrating source
    syntax into hierarchical phrase-based systems (REF.2).
Approach 1(SVM,LR)
●   Word classification
    ●   with SVM, a logistic regression classifier
●   Feature: Distance, Position(Before/After), in Segment(,.;
    and, but, for, nor, or, so, yet), POS tag, Dependency
    Distance, Dependency Relations, Common Ancestor Node,
    Syntactic Distance
●   Problem Example:
    ●   There are many POS taggers developed using different
        techniques for many major languages such as transformation-
        based error-driven learning (Brill, 1995), decision trees (Black et
        al., 1992), Markov model (Cutting et al., 1992), maximum entropy
        methods (Ratnaparkhi, 1996) etc for English.
Approach 2(CRF)
●   Sequence Labeling with CRF
    ●   feature is same as Approach 1
Approach 3-S1-* (CRF/segment)
●   segmentation (1)
    ●   punctuation marks
    ●   coordination conjunctions
        –   and, but, for, nor, or, so, yet
    ●   a set of special expressions
        –   "for example", "for instance", "including", "includes",
            "such as", "like", etc.
●   [Rerankers have been successfully applied to numerous
    NLP tasks such as] [parse selection (GTREF)], [parse
    reranking (GREF)], [question-answering (REF)].
Approach 3-S2-* (CRF/segment)
●   segmentation (2)
    ●   chunking tool
        –   noun groups
        –   verb groups
        –   preposition groups
        –   adjective groups
        –   adverb groups
        –   other parts form segment by themselves
●   [To] [score] [the output] [of] [the coreference models], [we]
    [employ] [the commonly-used MUC scoring program (REF)]
    [and] [the recently-developed CEAF scoring program (TREF)].
Approach 3-*-R1,2,3
                 (CRF/segment)
●   R1: majority label of the words it contains
●   R2: inside if any word is inside
●   R3: outside if any word is outside
    ●   [I O O O O] [I I I] [O O]
AR2011




the link grammar parser
(Sleator and Temperley,1991)
Experiment
Data
●   ACL Anthology Network Corpus
●   3300 sentences, citations in each ≧ 2


             Annotation agreement
●   500 of 3300,
    ●   Preprocessing is perfect
    ●   Kappa coefficient of scope is
              P ( A)−P ( E )
           K=                =2P ( A)−1=0.61
              1−P ( E )
Tools
●   Edinburgh Language Technology Text
    Tokenization Toolkit (LT-TTT)
    ●   text tokenization, part-of-speech tagging, chunking,
        and noun phrase head identification.
●   Stanford parser
    ●   syntactic and dependency parsing
●   LibSVM with linear kernel
●   Weka
    ●   logistic regression classification
Tools
●   Machine Learning for Language Toolkit
    (MALLET)
    ●   CRF

                    Validation
●   10-fold cross validation
Experiment (Preprocessing)
    These constraints can be lexicalized (REF.1; REF.2), ll
                                                   r ec a
●

    unlexicalized (REF.3; TREF.4) or and 93  .1%learned
    (REF.5; REF.6). 3% preci
                               s ion automatically
           ng: 9 8 .
    Taggi
●   These constraints can be lexicalized (GREF.1), unlexicalized
    (GTREF.2) or Perfect!
                 automatically learned (GREF.3).
    Grouping:
    (GTREF.1) apply fuzzy techniques for integrating source
                                      a l:
●


 syntax into hierarchicalence
                              removsystems (REF.2).
Non-syn    tactic refer phrase-based ecall
                            9 0. 1% r
                cision and
9 0.08% pre
Experiment (Main)
               ● CRF
               ● Chunking

               ● Majority
Feature Analysis
●   Feature: Distance, Position(Before/After), Same
    segment(,.; and, but, for, nor, or, so, yet), POS
    tag, Dependency Distance, Dependency
    Relations, Common Ancestor Node, Syntactic
    Distance
Summary
●   Identified reference scope in a sentence which
    has multiple citation
● CRF
● Chunking

● Majority
Reference Scope Identification in Citing Sentences

Reference Scope Identification in Citing Sentences

  • 1.
    Reference Scope Identification inCiting Sentences          Authors: Amjad Abu-Jbara, Dragomir Radev (University of Michigan)             Conference: NAACL 2012             Expositor: Akihiro Kameda (Aizawa Lab. The University of Tokyo)
  • 2.
    Abstract ● Problem: ● Multiple citation in one sentence ● There are many POS taggers developed using different techniques for many major languages such as transformation-based error-driven learning (Brill, 1995), decision trees (Black et al., 1992), Markov model (Cutting et al., 1992), maximum entropy methods (Ratnaparkhi, 1996) etc for English. ● Approach:Prepossessing      and 2+1+2*3+1=10 methods
  • 3.
  • 4.
    Reference Preprocessing (tagging, grouping, non-syntactical element removal) ● These constraints can be lexicalized (REF.1; REF.2), unlexicalized (REF.3; TREF.4) or automatically learned (REF.5; REF.6). ● These constraints can be lexicalized (GREF.1), unlexicalized (GTREF.2) or automatically learned (GREF.3). ● (GTREF.1) apply fuzzy techniques for integrating source syntax into hierarchical phrase-based systems (REF.2).
  • 5.
    Approach 1(SVM,LR) ● Word classification ● with SVM, a logistic regression classifier ● Feature: Distance, Position(Before/After), in Segment(,.; and, but, for, nor, or, so, yet), POS tag, Dependency Distance, Dependency Relations, Common Ancestor Node, Syntactic Distance ● Problem Example: ● There are many POS taggers developed using different techniques for many major languages such as transformation- based error-driven learning (Brill, 1995), decision trees (Black et al., 1992), Markov model (Cutting et al., 1992), maximum entropy methods (Ratnaparkhi, 1996) etc for English.
  • 6.
    Approach 2(CRF) ● Sequence Labeling with CRF ● feature is same as Approach 1
  • 7.
    Approach 3-S1-* (CRF/segment) ● segmentation (1) ● punctuation marks ● coordination conjunctions – and, but, for, nor, or, so, yet ● a set of special expressions – "for example", "for instance", "including", "includes", "such as", "like", etc. ● [Rerankers have been successfully applied to numerous NLP tasks such as] [parse selection (GTREF)], [parse reranking (GREF)], [question-answering (REF)].
  • 8.
    Approach 3-S2-* (CRF/segment) ● segmentation (2) ● chunking tool – noun groups – verb groups – preposition groups – adjective groups – adverb groups – other parts form segment by themselves ● [To] [score] [the output] [of] [the coreference models], [we] [employ] [the commonly-used MUC scoring program (REF)] [and] [the recently-developed CEAF scoring program (TREF)].
  • 9.
    Approach 3-*-R1,2,3 (CRF/segment) ● R1: majority label of the words it contains ● R2: inside if any word is inside ● R3: outside if any word is outside ● [I O O O O] [I I I] [O O]
  • 10.
    AR2011 the link grammarparser (Sleator and Temperley,1991)
  • 11.
  • 12.
    Data ● ACL Anthology Network Corpus ● 3300 sentences, citations in each ≧ 2 Annotation agreement ● 500 of 3300, ● Preprocessing is perfect ● Kappa coefficient of scope is P ( A)−P ( E ) K= =2P ( A)−1=0.61 1−P ( E )
  • 13.
    Tools ● Edinburgh Language Technology Text Tokenization Toolkit (LT-TTT) ● text tokenization, part-of-speech tagging, chunking, and noun phrase head identification. ● Stanford parser ● syntactic and dependency parsing ● LibSVM with linear kernel ● Weka ● logistic regression classification
  • 14.
    Tools ● Machine Learning for Language Toolkit (MALLET) ● CRF Validation ● 10-fold cross validation
  • 15.
    Experiment (Preprocessing) These constraints can be lexicalized (REF.1; REF.2), ll r ec a ● unlexicalized (REF.3; TREF.4) or and 93 .1%learned (REF.5; REF.6). 3% preci s ion automatically ng: 9 8 . Taggi ● These constraints can be lexicalized (GREF.1), unlexicalized (GTREF.2) or Perfect! automatically learned (GREF.3). Grouping: (GTREF.1) apply fuzzy techniques for integrating source a l: ● syntax into hierarchicalence removsystems (REF.2). Non-syn tactic refer phrase-based ecall 9 0. 1% r cision and 9 0.08% pre
  • 16.
    Experiment (Main) ● CRF ● Chunking ● Majority
  • 17.
    Feature Analysis ● Feature: Distance, Position(Before/After), Same segment(,.; and, but, for, nor, or, so, yet), POS tag, Dependency Distance, Dependency Relations, Common Ancestor Node, Syntactic Distance
  • 18.
    Summary ● Identified reference scope in a sentence which has multiple citation ● CRF ● Chunking ● Majority

Editor's Notes

  • #3 難波先生や自身らがある引用が文をまたがって説明されている場合のスコープの同定を扱っていることが関連研究に示されている。 応用は要約など。
  • #13 Annotator 2人なのでたまたま被る確率P(E)は1/2 P(A)は8割ちょい