Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PythonによるWikipediaを活用した自然言語処理

3,586 views

Published on

PyData.tokyo One-day Conference 2018での講演資料です。
https://pydatatokyo.connpass.com/event/87511/

Published in: Technology
  • Be the first to comment

PythonによるWikipediaを活用した自然言語処理

  1. 1. 

  2. 2. STUDIO OUSIA 2 
 ‣ ‣ ‣ ‣ ‣
  3. 3. 3
  4. 4. STUDIO OUSIA ‣ 
 ‣ 
 ✦ ✦ ✦ ‣ 
 4
  5. 5. STUDIO OUSIA ‣ ‣ 5 

  6. 6.
  7. 7. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ 
 ‣ ‣ 7
  8. 8. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ 
 ‣ ‣ 
 8 import bz2 import sys from rdflib import Graph def read_ttl(f): lines = [] for line in f: lines.append(line.decode('utf-8').rstrip()) if len(lines) == 1000: #1000行をまとめて処理 for triple in parse_lines(lines): yield triple lines = [] if lines: for triple in parse_lines(lines): yield triple def parse_lines(lines): g = Graph() g.parse(data=u'n'.join(lines), format='n3') return g with bz2.BZ2File(sys.argv[1]) as in_file: for (_, p, o) in read_ttl(in_file): if p.toPython() == 'http://persistence.uni- leipzig.org/nlp2rdf/ontologies/nif- core#isString': print(o.toPython()) % wget http://downloads.dbpedia.org/2016-10/core-i18n/ja/nif_context_ja.ttl.bz2 % python wiki_corpus.py nif_context_ja.ttl.bz2 > corpus.txt wiki_corpus.py
  9. 9. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ 
 ‣ 
 9 import logging import sys from gensim.models.word2vec import Word2Vec, LineSentence logging.basicConfig(level=logging.INFO) model = Word2Vec(LineSentence(sys.argv[1]), sg=1) model.save(sys.argv[2]) % mecab -Owakati corpus.txt -o corpus_wakati.txt % python word2vec.py corpus_wakati.txt wiki_w2v word2vec.py >>> model = Word2Vec.load(‘wiki_w2v’) >>> model.most_similar(‘日本’)[:3] [('韓国', 0.6719746589660645), ('台湾', 0.6447558403015137), ('英国', 0.6377681493759155)]
  10. 10. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ✦ ✦ ‣ 11
  11. 11. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp Wikipedia2Vec ‣ ‣ 12 % pip install wikipedia2vec
  12. 12. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ‣ 13 % wget http://wikipedia2vec.s3.amazonaws.com/models/ja/2018-04-20/ jawiki_20180420.db.bz2 -O jawiki.db.bz2 % bunzip2 jawiki.db.bz2 % wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages- articles.xml.bz2 % wikipedia2vec build_dump_db jawiki-latest-pages-articles.xml.bz2 jawiki.db
  13. 13. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ‣ ‣ 14 import sys import Levenshtein from collections import Counter from wikipedia2vec.dump_db import DumpDB dump_db = DumpDB(sys.argv[1]) pair_counter = Counter() for (title1, title2) in dump_db.redirects(): ops = Levenshtein.editops(title1.lower(), title2.lower()) if len(ops) == 1: (op, p1, p2) = ops[0] if op == 'replace': pair_counter[frozenset((title1[p1], title2[p2]))] += 1 for (pair, count) in pair_counter.most_common(): print('%st%st%d' % (*list(pair), count)) similar_char.py % python similar_char.py jawiki.db > out.tsv % cat out.tsv イ ー 1857 澤 沢 1747 ・ = 1124
  14. 14. ‣ ‣ ‣ ‣ 15
  15. 15. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp 17 ‣ ‣ ‣ ‣ 

  16. 16. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ‣ ✦ ✦ ‣ 
 18
  17. 17. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ✦ ✦ ✦ ‣ ✦ ✦ ✦ 
 
 
 19
  18. 18. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ‣ 20 % wget http://wikipedia2vec.s3.amazonaws.com/models/ja/2018-04-20/ jawiki_20180420_mention.pkl.bz2 -O jawiki_mention.pkl.bz2 % wget http://wikipedia2vec.s3.amazonaws.com/models/ja/2018-04-20/ jawiki_20180420_dic.pkl.bz2 -O jawiki_dic.pkl.bz2 % bunzip2 jawiki_dic.pkl.bz2 jawiki_mention.pkl.bz2 % wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages- articles.xml.bz2 % wikipedia2vec build_dump_db jawiki-latest-pages-articles.xml.bz2 jawiki.db % wikipedia2vec build_dictionary jawiki.db jawiki_dic.pkl % wikipedia2vec build_mention_db jawiki.db jawiki_dic.pkl jawiki_mention.pkl
  19. 19. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ 
 ‣ 
 ‣ ✦ 21 自民党 単語・フレーズとそのリンク確率の例
  20. 20. import sys from wikipedia2vec.dictionary import Dictionary from wikipedia2vec.mention_db import MentionDB dic = Dictionary.load(sys.argv[1]) db = MentionDB.load(sys.argv[2], dic) words = set() for mention in db: if mention.link_prob >= 0.2: if mention.text not in words: words.add(mention.text) print(mention.text) 22 word_dic.py % python word_dic.py jawiki_dic.pkl jawiki_mention.pkl > out.txt % cat out.txt | sort -R | less % cat out.txt | wc -l 1441724
  21. 21. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ‣ ✦ ✦ 23 This scientist names a constant that is equal to Loschmidt’s Constant times “RT over P” and is equal to the Faraday constant over the elementary charge. Wikipedia: Elementary_chargeWikipedia: Faraday_constant Wikipedia: Loschmidt_constant
  22. 22. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp 24 entity_linking.py % python entity_linking.py jawiki_dic.pkl jawiki_mention.pkl input.txt <Mention NHK連続テレビ小説 -> 連続テレビ小説> <Mention 半分、青い。 -> 半分、青い。> <Mention 永野芽郁 -> 永野芽郁> <Mention コラムニスト -> コラムニスト> <Mention 木村隆志 -> 木村隆志> import sys from wikipedia2vec.dictionary import Dictionary from wikipedia2vec.mention_db import MentionDB from wikipedia2vec.utils.tokenizer.mecab_tokenizer import MeCabTokenizer dic = Dictionary.load(sys.argv[2]) db = MentionDB.load(sys.argv[3], dic) with open(sys.argv[1]) as f: text = f.read() tokenizer = MeCabTokenizer() tokens = tokenizer.tokenize(text) for mention in db.detect_mentions(text, tokens): print(mention)
  23. 23. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ✦ ✦ ‣ 
 ✦ 26 Wikipedia2Vec: 

  24. 24. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp Wikipedia2Vec: 
 ‣ ‣ ‣ ‣ 
 27 Aristotle was a philosopher + Logic Science Europe Socrates Renaissance Metaphysics Philosopher Philosophy AvicennaAristotle Plato
  25. 25. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp Wikipedia2Vec: 28 https://wikipedia2vec.github.io ‣
  26. 26. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp Wikipedia2Vec ‣ ‣ ✦ ✦ 
 ✦ ✦ 29 f_dot = <float32_t>(blas.sdot(&dim_size, &syn0[index1, 0], &one, &syn1[index, 0], &one)) cdef inline void _train_pair(int32_t index1, int32_t index2, float32_t alpha, int32_t negative, int32_t [:] neg_table) nogil:
  27. 27. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp Wikipedia2Vec ‣ ‣ 
 ‣ 
 ‣ 
 30 % wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages- articles.xml.bz2 % wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 OUT_FILE
  28. 28. Wikipedia2Vec ‣ 
 ‣ 
 
 ‣ 
 ‣ 31
  29. 29. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp Wikipedia2Vec 32
  30. 30. Wikipedia2Vec
  31. 31. STUDIO OUSIA ‣ ‣ ‣ 34 Human-Computer Question Answering Match @ NIPS 2017
  32. 32. STUDIO OUSIA ‣ ‣ 
 ‣ 35 With%the%assistence%of%his%chief%minister,%the%Duc%de%Sully,%he%lowered% taxes%on%peasantry,%promoted%economic%recovery,%and%ins:tuted%a%tax%on% the%Paule<e.%Victor%at%Ivry%and%Arquet,%he%was%excluded%from%succession% by%the%Treaty%of%Nemours,%but%won%a%great%victory%at%Coutras. Henry%IV%of%France
  33. 33. STUDIO OUSIA Words Entities Sum Average The protagonist of … Protagonist Novel Author … … Answers Franz Kafka Tokyo Calcium Dot Softmax ‣ ‣ ✦ ‣ ‣ 36 The protagonist of a novel by this author is evicted from the Bridge Inn and is talked into becoming a school janitor…
  34. 34. STUDIO OUSIA ‣ pw, qe, ae ‣ pw qe vD 
 ‣ vD ae Words Entities Sum Average The protagonist of … Protagonist Novel Author … … Answers Franz Kafka Tokyo Calcium Dot Softmax 37 The protagonist of a novel by this author is evicted from the Bridge Inn and is talked into becoming a school janitor…
  35. 35. STUDIO OUSIA ‣ ‣ ✦ ‣ ‣ 38
  36. 36. STUDIO OUSIA ‣ ‣ ‣ ✦ ✦ 39 AI間の対戦でのシステムの解答精度 クイズエクスパートとの対戦の様子
  37. 37. STUDIO OUSIA ‣ ‣ ‣ 40
  38. 38. STUDIO OUSIA

×