Successfully reported this slideshow.

Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

6

Share

1 of 46
1 of 46

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

  1. 1. Japanese Text Mining with Scala and Spark Eduardo Gonzalez Scala Matsuri 2016
  2. 2. About Me • Eduardo Gonzalez • Japan Business Systems • Japanese System Integrator (SIer) • Social Systems Design Center (R&D) • Pittsburgh University • Computer Science • Japanese @wm_eddie
  3. 3. Agenda • Intro to Text mining with Spark • Pre-processing Japanese Text • Japanese Word Breaking • Spark Gotchas • Topic Extraction with LDA • Intro to Word2Vec • Recommendation with Word Embedding
  4. 4. Machine Learning Vocabulary • Feature: A number that represents something about a data point • Label: A feature of the data we want to predict • Document: A block of text with a unique ID • Model: A learned set of parameters that can be used for prediction • Corpus: A collection of documents 機械学習の前提となる語彙としてFeature、Label、Document、Model、Corpusが ある
  5. 5. What is Apache Spark • A library that defines a Resilient Distributed Dataset type and a set of transformations • RDDs are only representations of calculations • A runtime that can execute RDDs in a distributed manner • A master process that schedules and monitors executors • Executors actually do the calculations and can keep results in their memory • Spark SQL, MLLib and Graph X define special types of RDDs Sparkは汎用分散処理基盤で、SQL/機械学習/グラフといったコンポーネントを保 持する
  6. 6. Apache Spark Example import org.apache.spark.{SparkConf, SparkContext} object Main extends App { val sc = new SparkContext(new SparkConf()) val text = sc.textFile("hdfs:///kjb.txt") val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.collect().foreach(println) } SparkでWordCountアプリケーションを構築するとこのようになる
  7. 7. Spark’s Text-Mining Tools • LDA for Topic Extraction • Word2Vec an unsupervised way to turn words into features based on their meaning • CountVectorizer turns documents into vectors based on word count • HashingTF-IDF calculates important words of a document with respect to the corpus • And much more SparkのテキストマイニングツールとしてLDA、CountVectorizer、HashingTF- IDF等のツールがある
  8. 8. How to use Spark LDA import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel} import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("data/mllib/sample_lda_data.txt") val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble))) // Index documents with unique IDs val corpus = parsedData.zipWithIndex.map(_.swap).cache() // Cluster the documents into three topics using LDA val ldaModel = new LDA().setK(3).run(corpus)
  9. 9. sample_lda_data.txt ただ、入力のLDAデータは文章のようには見えない 1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0 (´Д`) This does not look like text
  10. 10. LDA Step 0: Get words LDA実行にあたり、まずはじめに単語を抽出する必要がある
  11. 11. Word Segmentation • Hard to actually get right. • Simple in theory with English • Str.Split(“ “) • But not enough for real data. • (Take parens for example.) • [“(Take”, “parens”, “for”, “example.)”] • Etc. 実際の単語抽出は難しく、区切りで分割するだけではうまくいかない
  12. 12. Word Segmentation • Since Japanese lacks spaces it’s hard even in theory • A probabilistic approach is necessary • Thankfully there are libraries that can help 日本語単語の抽出は単語区切り文字がなく、確率的アプローチが必要、ライブラ リで効率的に実行できる
  13. 13. Morphological Analyzers • Include POS tagging, pronunciation and stemming • MeCab • Written in C++with SWIG bindings to pretty much everything • Kuromoji • Written in Java available via maven • Others 形態素解析(品詞タグ付け、発音、語幹処理服務)用にMeCabやKuromoji等のラ イブラリがある
  14. 14. JMecab & Spark/Hadoop • Not impossible but difficult • Add MeCab to each node • Add jar to classpaths • Include jar in project for compilation • Not too bad with own hardware but painful with Amazon EMR or Azure HDInsight JMecabは事前Installが必要なため、オンプレでは何とかなるが、クラウド環境で は実行困難
  15. 15. Kuromoji & Spark/Hadoop • Easy • Include dependency in build.sbt • Include jar file in FatJar with sbt- assembly Kuromojiは依存性を追加し、FatJarをビルドするだけなので使いやすい
  16. 16. Using Kuromoji import org.atilika.kuromoji.Tokenizer object Main extends App { import scala.collection.JavaConverters.asScalaBufferConverter val tokenizer = Tokenizer.builder().build() val ex1 = "リストのような構造の物から条件を満たす物を探す" val res1 = tokenizer.tokenize(ex1).asScala for (token <- res1) { println(s"${token.getBaseForm}t${token.getPartOfSpeech}") } }
  17. 17. Using Kuromoji Kuromojiを使うとこのように認識される 厚生 名詞,一般,*,* 年金 名詞,一般,*,* 基金 名詞,一般,*,* 脱退 名詞,サ変接続,*,* に 助詞,格助詞,一般,* 伴う 動詞,自立,*,* 手続き 名詞,サ変接続,*,* について 助詞,格助詞,連語,* の 助詞,連体化,*,* リマ 名詞,固有名詞,地域,一般 インド 名詞,固有名詞,地域,国 です 助動詞,*,*,* リスト 名詞,一般,*,* の 助詞,連体化,*,* よう 名詞,非自立,助動詞語幹,* だ 助動詞,*,*,* 構造 名詞,一般,*,* の 助詞,連体化,*,* 物 名詞,非自立,一般,* から 助詞,格助詞,一般,* 条件 名詞,一般,*,* を 助詞,格助詞,一般,* 満たす 動詞,自立,*,* 物 名詞,非自立,一般,* を 助詞,格助詞,一般,* 探す 動詞,自立,*,*
  18. 18. Step 1: Build Vocabulary 語彙の構築
  19. 19. Vocabulary lazy val tokenizer = Tokenizer.builder().build() val text = sc.textFile("documents") val words = for { line <- text token <- tokenizer.tokenize(line).asScala } yield token.getBaseForm val vocab = words.distinct().zipWithIndex().collectAsMap()
  20. 20. Step 2: Create Corpus コーパスの作成
  21. 21. Corpus val documentWords: RDD[Array[String]] = text.map(line => tokenizer.tokenize(line).asScala.map(t => t.getBaseForm).toArray) val documentCounts: RDD[Array[(String, Int)]] = documentWords.map(words => words.distinct.map { word => (word, words.count(_ == word)) }) val documentIndexAndCount: RDD[Seq[(Int, Double)]] = documentCounts.map(wordsAndCount => wordsAndCount.map { case (word, count) => (vocab(word).toInt, count.toDouble) }) val corpus: RDD[(Long, Vector)] = documentIndexAndCount.map(Vectors.sparse(vocab.size, _)).zipWithIndex.map(_.swap)
  22. 22. Step 3: Learn Topics トピックモデルの学習
  23. 23. Learn Topics val ldaModel = new LDA().setK(10).setMaxIterations(100).run(corpus) val topics = ldaModel.describeTopics(10).map { case (terms, weights) => terms.map(vocabulary(_)).zip(weights) } topics.zipWithIndex.foreach { case (topic, i) => println(s"TOPIC $i") topic.foreach { case (term, weight) => println(s"$termt$weight") } println(s"==========") }
  24. 24. Step 4: Evaluate 結果の評価
  25. 25. Topics? Topic 0: です 0.10870545899718176。0.09623411796419644さん 0.06105040403724023 Topic 1: の0.11035671185240525を0.07860862808644907する 0.05605566895190625 Topic 2: お願い 0.07579177409154919ご0.04431117457391179よろしく 0.032788330612439916 結果は助詞や文章の補助単語になっていた
  26. 26. Step 5: GOTO 2
  27. 27. Filter Stopwords val popular = words .map(w => (w, 1)) .reduceByKey(_ + _) .sortBy(-_._2) .take(50) .map(_._1) .toSet val vocabIndicies = words.distinct().filter(w => !popular.contains(w)).zipWithIndex() val vocab: Map[String, Long] = vocabIndicies.collectAsMap() val vocabulary = vocabIndicies.collect().map(_._1) ストップワードの除去
  28. 28. Topics! Topic 0: 変更 0.032952997236706624サーバー 0.03140777729144046設定 0.021643554361727567エ ラー 0.017955380768330902 Topic 1: ログ 0.028665774057609564時間 0.026686704628121154時 0.02404938565591628発生 0.020797622509804107 Topic 2: 様0.0474658820402456株式会社 0.026174292703953685お世話 0.021939329774535308
  29. 29. Using the LDA model • Prediction requires a LocalLDAModel • Use .toLocal if isInstanceOf[DistributedLDAModel] • Convert to Vector using same steps • Be sure to filter out words not in the vocabulary • Call topicDistributions to see topic scores LDAモデルはトピックの予想のために使用される
  30. 30. Topics Prediction New document topics: 0.091084004103132,0.1044111561202625,0.09090943947509807,0.11607354553753861,0.104042 84803971378,0.09697071269561051,0.09571658794577831,0.0919546186785918,0.091762489301 32802,0.11707459810294643 New document topics: 0.09424474530277152,0.1183270779577911,0.09230776874419214,0.09835759337114718,0.1315 9581881630272,0.09279638945611612,0.094124104743527,0.09295449996673977,0.09291472297 512052,0.09237727866629193 トピックの予想 Topic 0 Topic 1 Topic 2 Topic …
  31. 31. Now what? • Find the minimum logLikelihood in a set of documents you know are OK • Report anomaly whenever a new document has a lower logLikelihood トピックを正しく予想できた集合の最小対数尤度を計算、新しい文書がその値を 下回ったら「異常」に分類
  32. 32. Anomaly Detection val newDoc = sc.parallelize(Seq("平素は当社サービスをご利用いただき、誠にありがとうございます。 ")) def stringToCountVector(strings: RDD[String]) = { . . . } val score = lda.logLikelihood(stringToCountVector(newDoc)) /* -2153492.694125671 */
  33. 33. Word2Vec • Created vectors that represents points in meaning space • Unsupervised but requires a lot of data to generate good vectors • Google’s sample vectors trained on 100 billion words (~X00GB?) • Vectors with less data can provide interesting similarities but can’t do so consistently Word2Vecでは単語をベクトル化して定量的に表現可能で、単語同士の類似度を 出すことができる
  34. 34. Word2Vec Intuition • Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013. 実際の単語ベクトル化例
  35. 35. Vector Concatenation ベクトル連結 ITEM_01 営業 活用 営業 の 情報 共有 と サポート. . .
  36. 36. Step 1: Make vectors 単語ベクトルの生成
  37. 37. Making Word2VecModel val documentWords: RDD[Seq[String]] = text.map(line => tokenizer.tokenize(line).asScala.map(_.getSurfaceForm).toSeq) documentWords.cache() val model = new Word2Vec().setVectorSize(300).fit(documentWords)
  38. 38. Step 2: Use vectors 単語ベクトルの適用
  39. 39. Using Word2VecModel model.findSynonyms(“日本”, 5).foreach(println) /* (マイクロソフト,3.750299190465294) (ビジネス,3.7329870992662104) (株式会社,3.323983664186244) (システムズ,3.1331352923187987) (ビジネスプロダクティビティ,2.595931613590554) */ 実際に単語類似度算出例、ただし、元データで結果は大きく変動するため元デー タが非常に重要 Big dataset is very important.
  40. 40. Recommendation • Paragraph Vectors • Not available in Spark T_T 文章のベクトル化によるレコメンドはSparkではできない
  41. 41. Embedding with Vector Concatenation • Calculate sum of words in description • Add it to vectors from Word2VecModel.getVectors with special keyword (Ex. ITEM_1234) • Create new Word2VecModel using constructor • ※Not state of the art but can produce reasonable recommendations without user rating data ベクトル連結による embedding、「アイテム」ごとに含まれる単語のベクトルを 合計する
  42. 42. Item Embedding (1/2) val embeds = Map( "ITEM_001_01" -> "営業部門の情報共有と活用をサポートし", "ITEM_001_02" -> "組織的な営業力・売れる仕組みを構築します", "ITEM_001_03" -> "営業情報のコミュニケーション基盤を構築する", "ITEM_002_01" -> "一般的なサーバ、ネットワーク機器やOSレベルの監視に加え", "ITEM_002_02" -> "またモニタリングポータルでは、アラームの発生状況", "ITEM_002_03" -> "監視システムにより取得されたパフォーマンス情報が逐次ダッシュボード形式", "ITEM_003_01" -> "IPネットワークインフラストラクチャを構築します", "ITEM_003_02" -> "導入にとどまらず、アプリケーションやOAシステムとの融合を図ったユニファイドコミュニ ケーション環境を構築", "ITEM_003_03" -> "企業内および企業外へのコンテンツの効果的な配信環境、閲覧環境をご提供します" )
  43. 43. Item Embedding (2/2) def stringToVector(s: String): Array[Double] = { val words = tokenizer.tokenize(s).asScala.map(_.getSurfaceForm).toSeq val vectors = words.map(word => Try(model.transform(word)).getOrElse(model.transform("は"))) val breezeVectors: Seq[DenseVector[Double]] = vectors.map(v => new DenseVector(v.toArray)) val concat = breezeVectors.foldLeft(DenseVector.zeros[Double](vectorLength))((a, b) => a :+ b) concat.toArray } val embedVectors: Map[String, Array[Float]] = embeds.map { case (key, value) => (key, stringToVector(value).map(_.toFloat)) } val embedModel = new Word2VecModel(embedVectors ++ model.getVectors)
  44. 44. Recommending Similar embedModel.findSynonyms("ITEM_001_01", 5).foreach(println) /* (ITEM_001_03,12.577457221575695) (ITEM_003_03,12.542920930725996) (ITEM_003_02,12.315240961298104) (ITEM_001_02,12.260734177166485) (ITEM_002_01,10.866897938028856) */ 類似度の計算
  45. 45. Recommending New val newSentence = stringToVector("会計・受発注及び生産管理を中心としたシステム") embedModel.findSynonyms(Vectors.dense(newSentence), 5).foreach(println) /* (ITEM_001_02,14.372981084681571) (ITEM_003_03,14.343473534848325) (ITEM_001_01,13.83593570884867) (ITEM_002_01,13.61507040314043) (ITEM_002_03,13.462141195072414) */ 新しいサンプルからのレコメンド
  46. 46. Thank you • Questions? • Example source code at: • https://github.com/wmeddie/spark-text

×