Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

自然言語処理にWikipediaを活用する

943 views

Published on

国立国語研究所主催「コーパスとしてのウェブテキスト活用シンポジウム」での講演資料です。
http://pj.ninjal.ac.jp/corpus_center/lrw2018-symposium.html

Published in: Engineering
  • Be the first to comment

自然言語処理にWikipediaを活用する

  1. 1. STUDIO OUSIA 
 ‣ ‣ ‣ ‣ ‣ ‣ ‣ 2
  2. 2. STUDIO OUSIA 3
  3. 3. STUDIO OUSIA ‣ 
 ‣ 
 ✦ ✦ ‣ 4
  4. 4. STUDIO OUSIA ‣ ✦ ✦ ✦ 
 ‣ 5 

  5. 5.
  6. 6. STUDIO OUSIA ‣ 
 ‣ ‣ 7
  7. 7. STUDIO OUSIA ‣ 
 ‣ ‣ 
 8 import bz2 import sys from rdflib import Graph def read_ttl(f): lines = [] for line in f: lines.append(line.decode('utf-8').rstrip()) if len(lines) == 1000: #1000行をまとめて処理 for triple in parse_lines(lines): yield triple lines = [] if lines: for triple in parse_lines(lines): yield triple def parse_lines(lines): g = Graph() g.parse(data=u'n'.join(lines), format='n3') return g with bz2.BZ2File(sys.argv[1]) as in_file: for (_, p, o) in read_ttl(in_file): if p.toPython() == 'http://persistence.uni- leipzig.org/nlp2rdf/ontologies/nif- core#isString': print(o.toPython()) % wget http://downloads.dbpedia.org/2016-10/core-i18n/ja/nif_context_ja.ttl.bz2 % python wiki_corpus.py nif_context_ja.ttl.bz2 > corpus.txt wiki_corpus.py
  8. 8. STUDIO OUSIA ‣ 
 ‣ 
 9 import logging import sys from gensim.models.word2vec import Word2Vec, LineSentence logging.basicConfig(level=logging.INFO) model = Word2Vec(LineSentence(sys.argv[1]), sg=1) model.save(sys.argv[2]) % mecab -Owakati corpus.txt -o corpus_wakati.txt % python word2vec.py corpus_wakati.txt wiki_w2v word2vec.py >>> model = Word2Vec.load(‘wiki_w2v’) >>> model.most_similar(‘日本’)[:3] [('韓国', 0.6719746589660645), ('台湾', 0.6447558403015137), ('英国', 0.6377681493759155)]
  9. 9. STUDIO OUSIA 11 ‣ ‣ ‣ ‣ 

  10. 10. STUDIO OUSIA ‣ ‣ ✦ ✦ ‣ 
 12
  11. 11. STUDIO OUSIA ‣ ✦ ✦ ✦ ‣ ✦ ✦ ✦ 
 
 
 13
  12. 12. STUDIO OUSIA ‣ 
 ‣ ‣ 14 % wget http://wikipedia2vec.s3.amazonaws.com/models/ja/2018-04-20/ jawiki_20180420_mention.pkl.bz2 -O jawiki_mention.pkl.bz2 % wget http://wikipedia2vec.s3.amazonaws.com/models/ja/2018-04-20/ jawiki_20180420_dic.pkl.bz2 -O jawiki_dic.pkl.bz2 % bunzip2 jawiki_dic.pkl.bz2 jawiki_mention.pkl.bz2 % pip install wikipedia2vec % wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages- articles.xml.bz2 % wikipedia2vec build_dump_db jawiki-latest-pages-articles.xml.bz2 jawiki.db % wikipedia2vec build_dictionary jawiki.db jawiki_dic.pkl % wikipedia2vec build_mention_db jawiki.db jawiki_dic.pkl jawiki_mention.pkl
  13. 13. STUDIO OUSIA ‣ 
 ‣ 
 ‣ ✦ 15 自民党 単語・フレーズとそのリンク確率の例
  14. 14. STUDIO OUSIA import sys from wikipedia2vec.dictionary import Dictionary from wikipedia2vec.mention_db import MentionDB dic = Dictionary.load(sys.argv[1]) db = MentionDB.load(sys.argv[2], dic) words = set() for mention in db: if mention.link_prob >= 0.2: if mention.text not in words: words.add(mention.text) print(mention.text) 16 word_dic.py % python word_dic.py jawiki_dic.pkl jawiki_mention.pkl > out.txt % cat out.txt | sort -R | less % cat out.txt | wc -l 1441724
  15. 15. STUDIO OUSIA ‣ 
 ‣ ✦ ✦ 
 17 This scientist names a constant that is equal to Loschmidt’s Constant times “RT over P” and is equal to the Faraday constant over the elementary charge. Wikipedia: Elementary_chargeWikipedia: Faraday_constant Wikipedia: Loschmidt_constant
  16. 16. STUDIO OUSIA 18 entity_linking.py % python entity_linking.py jawiki_dic.pkl jawiki_mention.pkl input.txt <Mention NHK連続テレビ小説 -> 連続テレビ小説> <Mention 半分、青い。 -> 半分、青い。> <Mention 永野芽郁 -> 永野芽郁> <Mention コラムニスト -> コラムニスト> <Mention 木村隆志 -> 木村隆志> import sys from wikipedia2vec.dictionary import Dictionary from wikipedia2vec.mention_db import MentionDB from wikipedia2vec.utils.tokenizer.mecab_tokenizer import MeCabTokenizer dic = Dictionary.load(sys.argv[2]) db = MentionDB.load(sys.argv[3], dic) with open(sys.argv[1]) as f: text = f.read() tokenizer = MeCabTokenizer() tokens = tokenizer.tokenize(text) for mention in db.detect_mentions(text, tokens): print(mention)
  17. 17. STUDIO OUSIA ‣ 
 ✦ ✦ ‣ 
 ✦ 20
  18. 18. STUDIO OUSIA Wikipedia2Vec: 
 ‣ ‣ ‣ ‣ 
 21 Aristotle was a philosopher + Logic Science Europe Socrates Renaissance Metaphysics Philosopher Philosophy AvicennaAristotle Plato
  19. 19. STUDIO OUSIA Wikipedia2Vec: 22 https://wikipedia2vec.github.io ‣ ‣
  20. 20. STUDIO OUSIA Wikipedia2Vec ‣ 
 ‣ ‣ 
 ‣ 
 ‣ 23 % pip install wikipedia2vec % wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages- articles.xml.bz2 % wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 OUT_FILE
  21. 21. Wikipedia2Vec ‣ 
 ‣ 
 
 ‣ 
 ‣ 24
  22. 22. STUDIO OUSIA Wikipedia2Vec 25
  23. 23. Wikipedia2Vec
  24. 24. STUDIO OUSIA ‣ ‣ ‣ 27 Human-Computer Question Answering Match @ NIPS 2017
  25. 25. STUDIO OUSIA ‣ ‣ 
 ‣ 28 With%the%assistence%of%his%chief%minister,%the%Duc%de%Sully,%he%lowered% taxes%on%peasantry,%promoted%economic%recovery,%and%ins:tuted%a%tax%on% the%Paule<e.%Victor%at%Ivry%and%Arquet,%he%was%excluded%from%succession% by%the%Treaty%of%Nemours,%but%won%a%great%victory%at%Coutras. Henry%IV%of%France
  26. 26. STUDIO OUSIA ‣ ✦ ✦ 
 ‣ ✦ 29
  27. 27. STUDIO OUSIA Words Entities Sum Average The protagonist of … Protagonist Novel Author … … Answers Franz Kafka Tokyo Calcium Dot Softmax ‣ 
 ‣ 
 
 ‣ 
 ‣ 
 30 The protagonist of a novel by this author is evicted from the Bridge Inn and is talked into becoming a school janitor…
  28. 28. STUDIO OUSIA ‣ pw, qe, ae ‣ pw qe vD 
 ‣ vD aeWords Entities Sum Average The protagonist of … Protagonist Novel Author … … Answers Franz Kafka Tokyo Calcium Dot Softmax 31 The protagonist of a novel by this author is evicted from the Bridge Inn and is talked into becoming a school janitor…
  29. 29. STUDIO OUSIA ‣ ‣ ✦ ‣ ‣ 32
  30. 30. STUDIO OUSIA ‣ ‣ ✦ ✦ 33 AI間の対戦でのシステムの解答精度 クイズエクスパートとの対戦の様子
  31. 31. STUDIO OUSIA 34 
 This scientist names a constant that is equal to Loschmidt’s Constant times “RT over P” and is equal to the Faraday constant over the elementary charge. 
 A project named for this scientist is seeking to define the kilogram using a sphere of silicon. 
 His namesake law states that, holding temperature and pressure constant, equal volumes of gases have an equal number of molecules. 
 For 10 points, name this Italian chemist who also names the number of molecules in a mole of a gas, equal to 6.022 times ten to the twenty-third power. 

  32. 32. STUDIO OUSIA ‣ 
 ‣ ‣ ‣ www.qaengine.ai 35
  33. 33. STUDIO OUSIA

×