Search Engineでの位置づけ
Text
Browser
user interest
Text
Text Processing and Modeling
logical view logical view
MeCab
Query
NLTK etc. Indexing
user feedback Operations
Crawler
inverted index / Data
query WordNet Access
Searching Index
retrieved docs
Documents
(Web or DB)
Ranking
ranked docs
11.
類義語の抽出
• WordNetから類義語抽出手順(日→英の場合)
–MeCabで標準形と品詞を取得
– 名詞・副詞・動詞・形容詞のみ抽出
– SQLにてword→sense→関連sense→関連word
• select * from word where lemma=? and pos=?, (標準形,品詞)
• select * from sense where wordid=?, (word["wordid"],)
• select * from sense where synset=? and lang=?,
(sense["synset"], “en”)
• select * from word where wordid=? and pos=?,
(sense2["wordid"], 品詞)
• これでOK?
– 結論から言うと、そのままではまずかった
推定例(BIM)
• 文章: 1= “a b c b d”, 2 = “a b e f b”, 3 = “b g c d”,
4 = “b d e”, 5 = “a b e g”, 6 = “b g h”, N=6
word a b c d e f g h
2 6 2 3 3 1 3 1
− + 0.5 4.5 0.5 4.5 3.5 3.5 5.5 3.5 5.5
+ 0.5 /2.5 /6.5 /2.5 /3.5 /3.5 /1.5 /3.5 /1.5
• クエリ: Q = “a c h” a c
− + 0.5 4.5 4.5
= 1 1 ∼ = ∙
∈1 ∩
+ 0.5 2.5 2.5
• ランキング 6 > 1 > 3 > 5 > 2 > 4
参考文献
• Christopher D.Manning, Prabhakar Raghavan, and Hinrich
Schütze, Introduction to Information Retrieval
• Michael McCandless, Erik Hatcher, and Otis Gospodnetid,
Lucene in Action, Second Edition
• Stephen Robertson, Hugo Zaragoza, SIGIR 2007 Tutorials -
The Probabilistic Relevance Model: BM25 and beyond.
• Donald Metzler, Victor Lavrenko, SIGIR 2009 Tutorials -
Probabilistic Models for Information Retrieval