An Alignment-based Pattern Representation Model
                                              for Information Extraction
                                                                         Seokhwan Kim, Minwoo Jeong, Gary Geunbae Lee
                                                                             {megaup, stardust, gblee}@postech.ac.kr
 Abstract - In this paper, we propose an alternative pattern representation model and the effective method of utilizing it. While the previous pattern
 representation models completely depend on the result of dependency analysis, our approach is basically based on the lexical alignment and
 considers the result of dependency analysis only as a meaningful feature of the alignment process. In this way, we can cope with the errors of
 incomplete dependency analysis. An evaluation of a scenario template task shows that our proposed model outperforms the previous
 syntax-dependent models.


                                                                  Pattern Representation Model for Information Extraction
 Information Extraction                                                               Pattern Representation Model for IE                                        Related Works
 Extracting the defined number of relevant                                            Problem Definition                                                          Lexical Sequence Pattern Models
arguments from natural language documents                                                    Ex)                                                                     A set of lexical sequences
 Subtasks                                                                                                                                                                   <hum_tgt> be kidnapped
                                                                                           About 50 peasants have been kidnapped by terrorists of the FMNL
                                                                                                                                                                             be kidnapped by <prep_ind>
 # of arguments                         subtask
        1                      named-entity recognition                                                                             Extraction Pattern                Syntax-dependent Pattern Models
        2                      binary relation extraction                                                                                   ?                           A set of subtrees (from D-tree)
   more than 2                 relation/event extraction
                                                                                                             incident type   kidnapping                                                      kidnapped
 Approach                                                                                                   prep_ind        terrorists                                         nsubjpass                        agent


   Automatic Pattern Learning                                                                               prep_org        FMNL                                               peasants                  terrorists
                                                                                                             hum_tgt         peasants
     Pattern Representation Model                                                                                                                                                                                       prep_of


     Pattern Learning Algorithm                                                                                                                                    (kidnapped ({HUM_TGT}-nsubjpass)    FMNL
                                                                                                                                                                    (kidnapped ({PREP_IND}-agent))
                                                                                                                                                                    (kidnapped ({PREP_IND}-agent ({PREP_ORG}-prep_of)))

                                                                                                                    Method
 Our Approach                                                                                        Pattern Sequences Extraction
  Pattern Model                                                                                     1) Searching the sentences containing all                    Ex)                      (3+1)/(0+1) = 4
    Lexical Sequence Pattern                                                                        arguments of each tuple in source documents
                                                                                                     2) Segmenting out subpart of the sentence                                    kidnapped
    + Term Weight (from Dependency Analysis)
                                                                                                     based on clausal boundaries                                nsubjpass                                agent
      <HUM_TGT>         of      [NP]     have    been      kidnapped   by     <PRED_IND>
                                                                                                     3) Replacing the parts of arguments in the
                                                                                                                                                                                (1+1)/(1+1) = 1                    (2+1)/(1+1) = 1.5
           1           0.33     0.33      4          4        4        1.5           1.5
                                                                                                     sub-sentence with argument labels
                                                                                                      Computing Term Weights                                  <HUM_TGT>                       <PREP_IND>
   Soft Pattern Matching
     Sequence Alignment                                                                                  wi = (ri + c) / (di + c)                              prep_of                                       prep_of

     about 50 peasants          have          been       kidnapped     by         terrorists              wi : weight of i-th term
                                                                                                          ri : number of relevant terms within                      [NP]                         <PREP_ORG>
                                                                                                              a subtree, ti as root
  <HUM_TGT>       of         [NP]      have      been      kidnapped         by     <PREP_IND>            di : distance from root node                     (0+1)/(2+1) = 0.33                            (1+1)/(2+1) =0.67
                                                                                                          c : for smoothing (default:1)
                                                                                                                                                         Experiment
 Pattern Matching
  Sequence Alignment                                                                                Experimental Setup                            Experimental Result
  Based on a Dynamic Programming                                                                     Data                                          Pattern Models
  Alignment Matrix                                                                                     MUC-3/4 Data                                  SVO Model (Yangarber ‘00)
               peasants have been kidnapped by terrorists                                                 About the Terrorism Events                  Linked-Chain Model (Greenwood ‘06)
   <HUM_TGT>      1       0   0        0     0     0
         of       0       1   0        0     0     0
                                                                                                        Simpler template structure with 4 slots       Subtree Model (Sudo ‘03)
        [NP]      0     0.66 1         0     0     0                                                      perp_ind, perp_org, phys_tgt, hum_tgt       Our Model
        have      0       4   3        2     1     0
        been      0       0   8        7     6     5                                                    Dev-set (training), Test-set (evaluation)   Result
     kidnapped    0       0   4       12     11    10                                                 Preprocessing                                      Model      Precision Recall F-measure
         by       0       0   0        8    13.5  12.5
   <PRED_IND>    1.5     0.5  0        0     0     15                                                   Dependency Parsing and NP-chunking                SVO        21.74    20.62    21.16

                                                                                                          Stanford Parser                             Linked-Chain   20.04    26.55    22.84
   Matrix Computation                                                                                Extracting Pattern Candidates                      Subtree     23.34    32.73    27.25
                                                                                                                                                        Alignment     23.35    45.62    30.89
                                       Mi-1,j-1 + sim i-1,j-1 * wi-1                                    Selecting all pattern candidates for test
                                       Mi-1,j + gp * wi-1                                               Without pattern filtering                     Our proposed model achieved much
          Mi,j = max
                                       Mi,j-1 + gp * wi                                                 To compare not the pattern filtering         higher recall than the other models with
                                       0                                                                  method, but the representative performance                      similar precision
                                                                                                          among pattern models

An Alignment-based Pattern Representation Model for Information Extraction

  • 1.
    An Alignment-based PatternRepresentation Model for Information Extraction Seokhwan Kim, Minwoo Jeong, Gary Geunbae Lee {megaup, stardust, gblee}@postech.ac.kr Abstract - In this paper, we propose an alternative pattern representation model and the effective method of utilizing it. While the previous pattern representation models completely depend on the result of dependency analysis, our approach is basically based on the lexical alignment and considers the result of dependency analysis only as a meaningful feature of the alignment process. In this way, we can cope with the errors of incomplete dependency analysis. An evaluation of a scenario template task shows that our proposed model outperforms the previous syntax-dependent models. Pattern Representation Model for Information Extraction  Information Extraction  Pattern Representation Model for IE  Related Works  Extracting the defined number of relevant  Problem Definition  Lexical Sequence Pattern Models arguments from natural language documents  Ex)  A set of lexical sequences  Subtasks <hum_tgt> be kidnapped About 50 peasants have been kidnapped by terrorists of the FMNL be kidnapped by <prep_ind> # of arguments subtask 1 named-entity recognition Extraction Pattern  Syntax-dependent Pattern Models 2 binary relation extraction ?  A set of subtrees (from D-tree) more than 2 relation/event extraction incident type kidnapping kidnapped  Approach prep_ind terrorists nsubjpass agent  Automatic Pattern Learning prep_org FMNL peasants terrorists hum_tgt peasants  Pattern Representation Model prep_of  Pattern Learning Algorithm (kidnapped ({HUM_TGT}-nsubjpass) FMNL (kidnapped ({PREP_IND}-agent)) (kidnapped ({PREP_IND}-agent ({PREP_ORG}-prep_of))) Method  Our Approach  Pattern Sequences Extraction  Pattern Model 1) Searching the sentences containing all  Ex) (3+1)/(0+1) = 4  Lexical Sequence Pattern arguments of each tuple in source documents 2) Segmenting out subpart of the sentence kidnapped  + Term Weight (from Dependency Analysis) based on clausal boundaries nsubjpass agent <HUM_TGT> of [NP] have been kidnapped by <PRED_IND> 3) Replacing the parts of arguments in the (1+1)/(1+1) = 1 (2+1)/(1+1) = 1.5 1 0.33 0.33 4 4 4 1.5 1.5 sub-sentence with argument labels  Computing Term Weights <HUM_TGT> <PREP_IND>  Soft Pattern Matching  Sequence Alignment wi = (ri + c) / (di + c) prep_of prep_of about 50 peasants have been kidnapped by terrorists wi : weight of i-th term ri : number of relevant terms within [NP] <PREP_ORG> a subtree, ti as root <HUM_TGT> of [NP] have been kidnapped by <PREP_IND> di : distance from root node (0+1)/(2+1) = 0.33 (1+1)/(2+1) =0.67 c : for smoothing (default:1) Experiment  Pattern Matching  Sequence Alignment  Experimental Setup  Experimental Result  Based on a Dynamic Programming  Data  Pattern Models  Alignment Matrix  MUC-3/4 Data  SVO Model (Yangarber ‘00) peasants have been kidnapped by terrorists  About the Terrorism Events  Linked-Chain Model (Greenwood ‘06) <HUM_TGT> 1 0 0 0 0 0 of 0 1 0 0 0 0  Simpler template structure with 4 slots  Subtree Model (Sudo ‘03) [NP] 0 0.66 1 0 0 0  perp_ind, perp_org, phys_tgt, hum_tgt  Our Model have 0 4 3 2 1 0 been 0 0 8 7 6 5  Dev-set (training), Test-set (evaluation)  Result kidnapped 0 0 4 12 11 10  Preprocessing Model Precision Recall F-measure by 0 0 0 8 13.5 12.5 <PRED_IND> 1.5 0.5 0 0 0 15  Dependency Parsing and NP-chunking SVO 21.74 20.62 21.16  Stanford Parser Linked-Chain 20.04 26.55 22.84  Matrix Computation  Extracting Pattern Candidates Subtree 23.34 32.73 27.25 Alignment 23.35 45.62 30.89 Mi-1,j-1 + sim i-1,j-1 * wi-1  Selecting all pattern candidates for test Mi-1,j + gp * wi-1  Without pattern filtering  Our proposed model achieved much Mi,j = max Mi,j-1 + gp * wi  To compare not the pattern filtering higher recall than the other models with 0 method, but the representative performance similar precision among pattern models