A CROSS-LINGUAL ANNOTATION PROJECTION-   BASED SELF-SUPERVISION APPROACH   FOR OPEN INFORMATION EXTRACTION  The 5th Intern...
Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conc...
Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conc...
Information Extraction• Goal   To generate structured information from natural language    documents      • Representing ...
Previous Approaches• Many supervised machine learning approaches have been  successfully applied to the RDC task    (Kamb...
Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conc...
Open Information Extraction• An alternative weakly-supervised IE paradigm    (Banko et al., 2007)• Problem Definition    ...
How to Eliminate Human Supervision• Self-supervised Learning for Open IE    Using automatically obtained training example...
What’s the Problem?• Previous approaches mainly depend on language-specific  knowledge for English    Heuristic-based App...
Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conc...
Cross-lingual Annotation Projection• Goal   To obtain training examples for the target language LT• Method   To leverage...
Cross-lingual Annotation Projection• Previous Work    Part-of-speech tagging (Yarowsky and Ngai, 2001)    Named-entity t...
Annotation• To obtain annotations for the sentences in LS• Procedure    A set of entities in the given sentence is identi...
Annotation• To obtain annotations for the sentences in LS• Procedure    A set of entities in the given sentence is identi...
Annotation• To obtain annotations for the sentences in LS• Procedure    A set of entities in the given sentence is identi...
Annotation• To obtain annotations for the sentences in LS• Procedure    A set of entities in the given sentence is identi...
Projection• To project the annotations from the sentences in LS onto  the sentences in LT using word alignment information...
Projection• To project the annotations from the sentences in LS onto  the sentences in LT using word alignment information...
Projection• To project the annotations from the sentences in LS onto  the sentences in LT using word alignment information...
Projection• To project the annotations from the sentences in LS onto  the sentences in LT using word alignment information...
Projection• To project the annotations from the sentences in LS onto  the sentences in LT using word alignment information...
Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conc...
Overall Architecture   English-                                 Korean RawKorean Parallel                                 ...
Cross-lingual Annotation Projection-      based Self-Supervision  Annotation                Parallel                      ...
Cross-lingual Annotation Projection-       based Self-Supervision• Dataset    English-Korean Parallel Corpus      • 266,8...
Cross-lingual Annotation Projection-       based Self-Supervision• English Open IE    Our own implementation of the Banko...
Cross-lingual Annotation Projection-       based Self-Supervision• Word Alignment   Aligned by GIZA++ toolkit     • In th...
Cross-lingual Annotation Projection-       based Self-Supervision• Annotated Dataset    English    598,115 instances    ...
Learning  Extraction• Extractor for Korean Open IE    Maximum Entropy (ME) model      • To detect whether or not each giv...
Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conc...
Evaluation #1• Dataset    250 sentences from Korean Wikipedia articles    With manually annotated gold standard      • 1...
Evaluation #1• Comparison of performances                Model              P    R    F               Heuristic          4...
Evaluation #1• Comparison of performances                Model              P    R    F               Heuristic          4...
Evaluation #1• Comparison of performances                Model              P    R    F               Heuristic          4...
Evaluation #1• Comparison of performances                Model              P    R    F               Heuristic          4...
Evaluation #2• Datasets    Korean Newswire       • 302,276 documents       • 2,565,487 sentences    Korean Wikipedia    ...
Evaluation #2• Evaluation results for four relation types                              Newswire                          W...
Evaluation #2• Distribution of the errors             Error Type                 # of errors             Chunking Error   ...
Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conc...
Conclusions• Summary   A Cross-lingual Annotation Projection Approach for Open IE   Korean Open IE system developed usin...
QA
Upcoming SlideShare
Loading in...5
×

A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

365

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
365
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

  1. 1. A CROSS-LINGUAL ANNOTATION PROJECTION- BASED SELF-SUPERVISION APPROACH FOR OPEN INFORMATION EXTRACTION The 5th International Joint Conference on Natural Language Processing (IJCNLP 2011) November 10th, 2011, Chiang Mai Seokhwan Kim (POSTECH) Minwoo Jeong (Microsoft Bing) Jonghoon Lee (POSTECH) Gary Geunbae Lee (POSTECH)
  2. 2. Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conclusions 2
  3. 3. Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conclusions 3
  4. 4. Information Extraction• Goal  To generate structured information from natural language documents • Representing semantic relationships among a set of arguments Birthday Barack Obama was born on August 4, 1961 , in Honolulu , Hawaii. Birthplace Person Barack Obama Birthday August 4, 1961 Birthplace Honolulu 4
  5. 5. Previous Approaches• Many supervised machine learning approaches have been successfully applied to the RDC task  (Kambhatla, 2004; Zhou et al., 2005; Zelenko et al., 2003; Culotta and Sorensen, 2004; Bunescu and Mooney, 2005; Zhang et al., 2006)  Large amounts of training data are required• Weakly-supervised techniques have been sought  (Zhang, 2004; Chen et al., 2006; Zhou et al., 2009)  To learn the IE system without significant annotation effort• Open Information Extraction  (Banko et al., 2007; Wu and Weld, 2010) 5
  6. 6. Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conclusions 6
  7. 7. Open Information Extraction• An alternative weakly-supervised IE paradigm  (Banko et al., 2007)• Problem Definition : → , , , 1 ≤ , ≤  Binary relation extraction between ei and ej  Considering relationships explicitly represented by ri,j• Goal  Large-scale IE • Domain-independent • Relation-independent  Without hand-crafted rules or hand-annotated training examples 7
  8. 8. How to Eliminate Human Supervision• Self-supervised Learning for Open IE  Using automatically obtained training examples • From external knowledge• Previous Systems  TextRunner (Banko et al., 2007) • Penn Treebank • A small set of heuristics about syntactic structural constraints  WoE (Wu and Weld, 2010) • Wikipedia articles • Wikipedia Infoboxes 8
  9. 9. What’s the Problem?• Previous approaches mainly depend on language-specific knowledge for English  Heuristic-based Approach • Syntactic treebank for the target language • Heuristics designed for the target language  Wikipedia-based Approach • Wikipedia articles and infoboxes are available not only for English • Differences among languages in the amount of available resources  English Wikipedia: 3,500,000 articles  Korean Wikipedia: 150,000 articles 9
  10. 10. Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conclusions 10
  11. 11. Cross-lingual Annotation Projection• Goal  To obtain training examples for the target language LT• Method  To leverage parallel corpora to project the annotations on the source language LS to the target language LT  The premise is that parallel corpora between LS and LT are much easier to obtain than the task-specific training dataset for LT e1, r12, e2 = Barack Obama, was born in, Honolulu Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) e1, r13, e3 = beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru 11
  12. 12. Cross-lingual Annotation Projection• Previous Work  Part-of-speech tagging (Yarowsky and Ngai, 2001)  Named-entity tagging (Yarowsky et al., 2001)  Verb classification (Merlo et al., 2002)  Dependency parsing (Hwa et al., 2005)  Mention detection (Zitouni and Florian, 2008)  Semantic role labeling (Pado and Lapata, 2009)• To the best of our knowledge, no work has reported on the Open IE task 12
  13. 13. Annotation• To obtain annotations for the sentences in LS• Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, extraction is performed 13
  14. 14. Annotation• To obtain annotations for the sentences in LS• Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, extraction is performed Barack Obama was born in Honolulu , Hawaii . 14
  15. 15. Annotation• To obtain annotations for the sentences in LS• Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, extraction is performed Barack Obama was born in Honolulu , Hawaii . 15
  16. 16. Annotation• To obtain annotations for the sentences in LS• Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, extraction is performed e1, r12, e2 = Barack Obama, was born in, Honolulu Barack Obama was born in Honolulu , Hawaii . 16
  17. 17. Projection• To project the annotations from the sentences in LS onto the sentences in LT using word alignment information• Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected 17
  18. 18. Projection• To project the annotations from the sentences in LS onto the sentences in LT using word alignment information• Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected e1, r12, e2 = Barack Obama, was born in, Honolulu Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) 18
  19. 19. Projection• To project the annotations from the sentences in LS onto the sentences in LT using word alignment information• Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected e1, r12, e2 = Barack Obama, was born in, Honolulu Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) 19
  20. 20. Projection• To project the annotations from the sentences in LS onto the sentences in LT using word alignment information• Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected e1, r12, e2 = Barack Obama, was born in, Honolulu Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) 20
  21. 21. Projection• To project the annotations from the sentences in LS onto the sentences in LT using word alignment information• Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected e1, r12, e2 = Barack Obama, was born in, Honolulu Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) e1, r13, e3 = beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru 21
  22. 22. Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conclusions 22
  23. 23. Overall Architecture English- Korean RawKorean Parallel Text Corpus Self- Learning ExtractionSupervision Korean Korean Open Extracted Annotated IE Model Results Corpus 23
  24. 24. Cross-lingual Annotation Projection- based Self-Supervision Annotation Parallel Projection Corpus English Korean Sentences Sentences Korean English Preprocessors Preprocessors Word Alignment English Open IE System Projection English Annotated Corpus Korean Annotated Corpus 24
  25. 25. Cross-lingual Annotation Projection- based Self-Supervision• Dataset  English-Korean Parallel Corpus • 266,892 bi-sentence pairs in English and Korean• Preprocessors  English • OpenNLP toolkit  Korean • Espresso toolkit 25
  26. 26. Cross-lingual Annotation Projection- based Self-Supervision• English Open IE  Our own implementation of the Banko’s method • Dataset  The WSJ part of Penn Treebank  By applying a series of heuristics (Banko, 2009)  1,028,361 instances from 49,208 sentences (9.0% were positive) • Model  Conditional Random Fields (CRF) • With Lexical and POS tag features • CRF++ toolkit 26
  27. 27. Cross-lingual Annotation Projection- based Self-Supervision• Word Alignment  Aligned by GIZA++ toolkit • In the standard configuration in both directions • The bi-directional alignments were joined using the grow-diag-final algorithm  Chunk-based Reorganization • To reduce the word alignment errors • Generating alignments between pairs of base phrase chunks • Using a simple greedy algorithm  Based on the overlap score of aligned words between base phrase chunks 27
  28. 28. Cross-lingual Annotation Projection- based Self-Supervision• Annotated Dataset  English  598,115 instances • 169.771 positive instances• Projected Dataset  Korean  278,730 instances • 89,743 positive instances 28
  29. 29. Learning Extraction• Extractor for Korean Open IE  Maximum Entropy (ME) model • To detect whether or not each given instance is positive • Features  Lexical, POS Tag  On the dependency path • Maximum Entropy Modeling toolkit  Conditional Random Fields (CRF) model • To identify the contextual subtext indicating the semantic relationship • Features  Lexical, POS Tag  On the dependency path • CRF++ toolkit 29
  30. 30. Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conclusions 30
  31. 31. Evaluation #1• Dataset  250 sentences from Korean Wikipedia articles  With manually annotated gold standard • 1,434 instances • 308 positive instances• Baseline  Heuristic-based System • Sejong treebank corpus (Korean) • A set of heuristics utilized for the English Open IE system except language-specific rules 31
  32. 32. Evaluation #1• Comparison of performances Model P R F Heuristic 47.7 20.1 28.3 Projection 33.6 49.0 39.8 Heuristic + Projection 41.9 46.4 44.1 32
  33. 33. Evaluation #1• Comparison of performances Model P R F Heuristic 47.7 20.1 28.3 Projection 33.6 49.0 39.8 Heuristic + Projection 41.9 46.4 44.1 33
  34. 34. Evaluation #1• Comparison of performances Model P R F Heuristic 47.7 20.1 28.3 Projection 33.6 49.0 39.8 Heuristic + Projection 41.9 46.4 44.1 34
  35. 35. Evaluation #1• Comparison of performances Model P R F Heuristic 47.7 20.1 28.3 Projection 33.6 49.0 39.8 Heuristic + Projection 41.9 46.4 44.1 35
  36. 36. Evaluation #2• Datasets  Korean Newswire • 302,276 documents • 2,565,487 sentences  Korean Wikipedia • 123,000 articles • 1,342,003 sentences• Manual Evaluation  For four relation types • BIRTH_PLACE, WON_AWARD, ACQUISITION, INVENT_OF 36
  37. 37. Evaluation #2• Evaluation results for four relation types Newswire Wikipedia Type precision # of extractions precision # of extractions Birth Place 65.2 256 69.1 971 Won Award 57.4 824 63.3 286 Acquisition 67.0 1112 50.3 143 Invent Of 53.1 32 47.6 103 3,727 extractions with a precision of 63.7% for four relation types 37
  38. 38. Evaluation #2• Distribution of the errors Error Type # of errors Chunking Error 364 (26.9%) Dependency Parsing Error 461 (34.1%) Extracting Error 527 (39.0%) 38
  39. 39. Contents• Introduction• Open Information Extraction• Cross-Lingual Annotation Projection• Implementation• Evaluation• Conclusions 39
  40. 40. Conclusions• Summary  A Cross-lingual Annotation Projection Approach for Open IE  Korean Open IE system developed using an English Open IE system and an English-Korean parallel corpus  Our system outperformed the heuristic-based system  Our system achieved 63.7% in precision from a large-scale evaluation• Ongoing Work  Reducing sensitivity to the errors committed by preprocessors  Investigating hybrid approaches considering various external knowledge sources 40
  41. 41. QA
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×