8. 分析の流れ
フレームワーク
. The Canopy Framework : Four main components
. トピック抽出
1
コーパスに LDA を適用しトピックを抽出
. the word-sense disambiguation (WSD)
2
The WSD determines a set C θ of DBpedia concepts, where each
C ∈ C θ represents the identified sense of one of the top-k words of
a topic.
. グラフ抽出
3
a good candidate set by extracting a topic graph G from DBpedia
consisting of the close neighbours of concepts Ci and the links
between them
we investigate how to define the relation r (C θ , C ∗ )
. 抽出したグラフへのラべリング
4
We adopt principles from social network analysis to identify in G the
most prominent concepts for labelling a topic θ
.
Unsupervised Graph-based Topic Labelling using DBpedia
.
.
.
.
June 30, 2013
.
8 / 21
12. 分析の流れ
定式化
Let C θ be a set of n DBpedia concepts Ci , i = 1,...n, that
correspond to a subset of the top-k words representing one topic
The problem is to identify the concept C ∗ from all available
concepts in DBpedia, such that the relation r (C θ , C ∗ ) is done by
Centrality
.
Unsupervised Graph-based Topic Labelling using DBpedia
.
.
.
.
June 30, 2013
.
12 / 21
13. DBpedia からのグラフ作成
Sense Graph Connectivity within a Topic Graph
. Outline
.
1
Abstruct
動機
主要結果
.
2
分析の流れ
フレームワーク
実行例
定式化
.
3
DBpedia からのグラフ作成
Sense Graph Connectivity within a Topic Graph
ラべリング
.
4
実験
データ
評価方法
結果
.
Unsupervised Graph-based Topic Labelling using DBpedia
.
.
.
.
June 30, 2013
.
13 / 21
16. DBpedia からのグラフ作成
ラべリング
. 中心性
. 一般的:最短経路のみ考慮
1
Closeness centrality
Betweenness centrality
. 最短経路でなく、ネットワークの接続全接続可能性を考慮
2
Information centrality
Random walk betweenness centrality
. 筆者が採用した方法
3
Focused Closeness Centrality(fCC)
Focused Information Centrality(fIC)
Focused Betweenness Centrality(fBC)
Focused Random Walk Betweenness Centrality(fRWB)
The above measures fCC; fIC; fBC and fRWB are the ones that
we experimented with for defining the target function r, which
quantifies the strength of the relation between each candidate
concept and all other concepts in the topic graph G
.
Unsupervised Graph-based Topic Labelling using DBpedia
.
.
.
.
June 30, 2013
.
16 / 21
18. 実験
データ
British AcademicWritten English Corpus
BBC corpus
StackExchange dataset
ただし、ストップ URL によりデータ圧縮
.
Unsupervised Graph-based Topic Labelling using DBpedia
.
.
.
.
June 30, 2013
.
18 / 21
20. 実験
評価方法
モニターユーザーに ”Good Fit”, ”Too Broad”, ”Related but not a good
label”, ”Unrelated” というラベルをつけさせ、評価には以下の 2 つのクラ
スに分類したデータを使用;
. Good Fit
1
Good Fit
. Good-Fit-or-Broader
2
Good Fit
Too Broad
Precision(k ) =
Coverage(k ) =
Hits with rank ≤ k
k
topics with at least one Hit at rank ≤ k
topics
.
Unsupervised Graph-based Topic Labelling using DBpedia
.
.
.
.
June 30, 2013
.
20 / 21