Large-Scale Information Extraction from Textual Definitions through Deep Syntactic and Semantic Analysis

Large-Scale Information
Extraction from Textual Definitions
through Deep Syntactic and
Semantic Analysis
TACL 2015
Claudio Delli Bovi, Luca Telesca
and Roberto Navigli
Presentation: Koji Matsuda (Tohoku University)
1
著者のスライドから一部の図を拝借しています：
http://wwwusers.di.uniroma1.it/~dellibovi/talks/talk_OIE.pdf

すごいKnowledge Base
どんな論文？:テキストから知識を抽出
• 本論文の主張:
– 文中のエンティティ・語義をグラウンディング
(WSD, EL)してから Open IE しましょう！
– 密で質の良い(曖昧性が解消された)知識が獲得で
きます
2
マイタイ ( MAI-TAI ) と
は、ラムをベースとし
たカクテルである。
「トロピカル・カクテ
ルの女王」などと称さ
れることもある。
主語述語目的語
マイタイとはカクテル
マイタイベースラム
マイタイ称される「トロピ
….」
: : :
<arg1, relation, arg2>

どんな論文？: 曖昧性が解消された
知識を作ります
• ポイント
– Entity Linking, WSD, Parsing の結果得られたグラフ
から知識獲得
• エンティティ / 語義に紐付いた知識を構文木から
獲得
⇔ 表層(mention)に関する情報を獲得
– 入力を「定義文」に絞る
• ノイズが少ないテキストから (Precisionの高い)知識を獲得
⇔ ノイジーなウェブスケールのコーパス(ClueWeb等)から多
様な知識を獲得
• 成果
– Fully Disambiguated な KB
– Open-vocabrary だけど (比較的) dense
3

グラウンディング(EL, WSD)してか
ら知識を抽出
4
マイタイ ( MAI-TAI ) とは、
ラムをベースとしたカク
テルである。「トロピカ
ル・カクテルの女王」な
どと称されることもある。
マイタイ ( MAI-TAI ) とは、
ラムをベースとしたカク
テルである。「トロピカ
ル・カクテルの女王」な
どと称されることもある。
マイタイ ( MAI-TAI ) と
は、ラムをベースとし
たカクテルである。
….....
元にした
呼ばれる
✓
×
×
✓
エンティティ
語義主語述語目的語
マイタイ_bn038v とは_bn038v カクテル_bn038v
マイタイ_bn038v 元にした_bn038v ラム_bn038v
マイタイ_bn038v 呼ばれる_bn038v 「トロピ….」
: : :
曖昧性が
解消され
た知識
ベース

入力を定義文に絞る
5
ここから
精密に知
識抽出し
ます
ここは扱
いません

背景 - 最近のKB生成
• Open IE とその子孫たち
– NELL [Carlson+, 2012] / ReVerb [Fader+, 2011] / Ollie
[Mausam+, 2012]
• KB 拡張、特に Distant Supervision / Universal
Schema
– [Hoffmann+, 2011] / [Riedel+, 2010]
• どちらの技術も、
– 「巨大なコーパスから, 多様な関係を取る」という方
向性に進化
• その結果出てきている問題
– Argument も Relation も曖昧性が解消されていない
– スパースすぎて使い物にならない
• 関係のロングテール
6

DefIE
7
DefIE:How it works
http://lcl.uniroma1.it/defie
“Atom Heart Mother is the fifth
album by English band Pink Floyd.”
Syntactic-Semantic
Graph 𝑮 𝑑
𝑠𝑒𝑚
1. Extracting relation instances
Dependency Parse
Entity Linking, WSD
このグラフから情報を取り出す

Syntactic-Semantic Graphからの
知識獲得
8
DefIE:How it works
=
=
=
=
Extraction1Extraction2
1. Extracting relation instances
エンティティペアの最短パスを取る
不要な知識がいっぱい取れるので、スコアリングします

知識ベースを使ったスコアリング
9
DefIE:How it works
2. Relation typing and scoring
For each relation 𝑅:
Compute the score of 𝑅 as
Total number of
extracted instances
for 𝑅
Length of the
relation pattern of 𝑅
Domain and range
entropy of 𝑅
知識ベースにグラウンドされているので、知識ベースを使ってRelationの
良し悪しをはかることが可能
パタンの頻度
パタンの(項の)曖昧性
Domain, Rangeの上位語を
(BabelNetから)求めて、その
上で曖昧性を計算
パタンの長さ

スコアの計算例
10
DefIE:How it works
2. Relation typing and scoring

Relation Taxonomization
11
DefIE:How it works
3. Relation taxonomization
Hypernym Generalization Substring Generalization

Evaluation
• 入力コーパス:
– BabelNet の ``definition’’ : 4.4M sentence
• Wikipedia の first-sentence が主
– WSD, EL:
• 比較
– NELL [Carlson+, 2010]
– PATTY+Wikipedia [Nakashole+, 2012]
– ReVerb+ClueWeb [Fader+, 2010]
– WiSeNet+Wikipedia [Moro and Navigli, 2013]
12
グラフベースの手
法 [Moro, 2013]

Evaluation (Size, Precision)
13
入力テキストコーパスは比較的小さい(4.4M Sentence)
が、より多くの知識を獲得できている
定義文だけ
Full
Wikipedia
Full
Wikipedia
ClueWeb
09

Evaluation (Precision, Novelty)
14
サンプルした
知識を人手で
見て正しいか
評価
等価な知識
が存在する
か、対抗KB
に対して人
手で調査
6割の知識は
ReVerbでは
取れない

Evaluation (Coverage)
15
Musicianに関するWikipediaの記事
5記事に対して
人力 IE して Gold を作成、そのう
ちどれくらいをカバーできるか調
査
• FB, Dbpediaは本文の情報を
使っていない
Freebaseから100個取ってくると、
そのうち83個くらいはDefIEでカ
バーできている
ウェブスケールのコーパスを
使わなくても、7割くらいカバーで
きる

Evaluation (その他)
16
non-definitional textを入力に
すると、precisionがガクッ
と下がる
既存手法の入力を
definitional textだけ
にすると、獲得数がガクッ
と下がる
従属節, 共参照 etc…

まとめ
• モノの定義に関する「グラ
ウンドされた知識」をテキ
ストから抽出
– EL, WSD, Parsing
• やみくもに大規模コーパス
を使うのではなく、定義文
のみから既存のKBに入って
いないような知識が獲得で
きている
17
ここから知識を抽出

18
BabelNet
• Multilingual Encyclopedic Dictionary
– Lexicographic & Encyclopedic knowledge
– Based on Automatic Integration of :
• WordNet, Wikipedia, Wiktionary, …
Named Entities and specialized
concepts from Wikipedia
Concepts from WordNet
50 Languages
21M definitions
62M entries
18
Concepts integrated from both
resources

19
Lexical Knolwdge
Base
Encyclopedical Knolwdge Base
Integrated Knowledge Base
Thomas Muller
striker
Munich
Mario Gomez
Thomas Millan
playing
FC Bayern Munich
Semantic Interpretation Graph
Semantic Signature
→ Select most suitable meaning on the Graph
Thomas and Mario
are strikers playing
in Munich. They are
…
Input Text
[Moro+, 2013]

Large-Scale Information Extraction from Textual Definitions through Deep Syntactic and Semantic Analysis

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Large-Scale Information Extraction from Textual Definitions through Deep Syntactic and Semantic Analysis

Similar to Large-Scale Information Extraction from Textual Definitions through Deep Syntactic and Semantic Analysis (20)

More from Koji Matsuda

More from Koji Matsuda (18)

Large-Scale Information Extraction from Textual Definitions through Deep Syntactic and Semantic Analysis