About Me
• Eduardo Gonzalez
• Japan Business Systems
• Japanese System Integrator (SIer)
• Social Systems Design Center (R&D)
• Pittsburgh University
• Computer Science
• Japanese
@wm_eddie
Agenda
• Intro to Text mining with Spark
• Pre-processing Japanese Text
• Japanese Word Breaking
• Spark Gotchas
• Topic Extraction with LDA
• Intro to Word2Vec
• Recommendation with Word Embedding
Machine Learning
Vocabulary
• Feature: A number that represents something
about a data point
• Label: A feature of the data we want to predict
• Document: A block of text with a unique ID
• Model: A learned set of parameters that can
be used for prediction
• Corpus: A collection of documents
機械学習の前提となる語彙としてFeature、Label、Document、Model、Corpusが
ある
What is Apache
Spark
• A library that defines a Resilient Distributed Dataset
type and a set of transformations
• RDDs are only representations of calculations
• A runtime that can execute RDDs in a distributed
manner
• A master process that schedules and monitors executors
• Executors actually do the calculations and can keep results in their
memory
• Spark SQL, MLLib and Graph X define special types of
RDDs
Sparkは汎用分散処理基盤で、SQL/機械学習/グラフといったコンポーネントを保
持する
Apache Spark
Example
import org.apache.spark.{SparkConf, SparkContext}
object Main extends App {
val sc = new SparkContext(new SparkConf())
val text = sc.textFile("hdfs:///kjb.txt")
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.collect().foreach(println)
}
SparkでWordCountアプリケーションを構築するとこのようになる
Spark’s Text-Mining
Tools
• LDA for Topic Extraction
• Word2Vec an unsupervised way to turn words
into features based on their meaning
• CountVectorizer turns documents into vectors
based on word count
• HashingTF-IDF calculates important words of
a document with respect to the corpus
• And much more
SparkのテキストマイニングツールとしてLDA、CountVectorizer、HashingTF-
IDF等のツールがある
How to use Spark
LDA
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/sample_lda_data.txt")
val parsedData = data.map(s => Vectors.dense(s.trim.split('
').map(_.toDouble)))
// Index documents with unique IDs
val corpus = parsedData.zipWithIndex.map(_.swap).cache()
// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)
LDA Step 0: Get
words
LDA実行にあたり、まずはじめに単語を抽出する必要がある
Word Segmentation
• Hard to actually get right.
• Simple in theory with English
• Str.Split(“ “)
• But not enough for real data.
• (Take parens for example.)
• [“(Take”, “parens”, “for”, “example.)”]
• Etc.
実際の単語抽出は難しく、区切りで分割するだけではうまくいかない
Word Segmentation
• Since Japanese lacks spaces it’s hard even in
theory
• A probabilistic approach is necessary
• Thankfully there are libraries that can help
日本語単語の抽出は単語区切り文字がなく、確率的アプローチが必要、ライブラ
リで効率的に実行できる
Morphological
Analyzers
• Include POS tagging, pronunciation and
stemming
• MeCab
• Written in C++with SWIG bindings to pretty
much everything
• Kuromoji
• Written in Java available via maven
• Others
形態素解析(品詞タグ付け、発音、語幹処理服務)用にMeCabやKuromoji等のラ
イブラリがある
JMecab &
Spark/Hadoop
• Not impossible but difficult
• Add MeCab to each node
• Add jar to classpaths
• Include jar in project for compilation
• Not too bad with own hardware but
painful with Amazon EMR or Azure
HDInsight
JMecabは事前Installが必要なため、オンプレでは何とかなるが、クラウド環境で
は実行困難
Kuromoji &
Spark/Hadoop
• Easy
• Include dependency in build.sbt
• Include jar file in FatJar with sbt-
assembly
Kuromojiは依存性を追加し、FatJarをビルドするだけなので使いやすい
Using Kuromoji
import org.atilika.kuromoji.Tokenizer
object Main extends App {
import scala.collection.JavaConverters.asScalaBufferConverter
val tokenizer = Tokenizer.builder().build()
val ex1 = "リストのような構造の物から条件を満たす物を探す"
val res1 = tokenizer.tokenize(ex1).asScala
for (token <- res1) {
println(s"${token.getBaseForm}t${token.getPartOfSpeech}")
}
}
Vocabulary
lazy val tokenizer = Tokenizer.builder().build()
val text = sc.textFile("documents")
val words = for {
line <- text
token <- tokenizer.tokenize(line).asScala
} yield token.getBaseForm
val vocab = words.distinct().zipWithIndex().collectAsMap()
Using the LDA model
• Prediction requires a LocalLDAModel
• Use .toLocal if
isInstanceOf[DistributedLDAModel]
• Convert to Vector using same steps
• Be sure to filter out words not in the vocabulary
• Call topicDistributions to see topic scores
LDAモデルはトピックの予想のために使用される
Now what?
• Find the minimum logLikelihood in a set
of documents you know are OK
• Report anomaly whenever a new
document has a lower logLikelihood
トピックを正しく予想できた集合の最小対数尤度を計算、新しい文書がその値を
下回ったら「異常」に分類
Word2Vec
• Created vectors that represents points in
meaning space
• Unsupervised but requires a lot of data to
generate good vectors
• Google’s sample vectors trained on 100
billion words (~X00GB?)
• Vectors with less data can provide
interesting similarities but can’t do so
consistently
Word2Vecでは単語をベクトル化して定量的に表現可能で、単語同士の類似度を
出すことができる
Word2Vec Intuition
• Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.
Linguistic Regularities in Continuous Space Word
Representations. In Proceedings of NAACL HLT, 2013.
実際の単語ベクトル化例
Embedding with Vector
Concatenation
• Calculate sum of words in description
• Add it to vectors from
Word2VecModel.getVectors with special
keyword (Ex. ITEM_1234)
• Create new Word2VecModel using constructor
• ※Not state of the art but can produce
reasonable recommendations without user
rating data
ベクトル連結による embedding、「アイテム」ごとに含まれる単語のベクトルを
合計する
Item Embedding (2/2)
def stringToVector(s: String): Array[Double] = {
val words = tokenizer.tokenize(s).asScala.map(_.getSurfaceForm).toSeq
val vectors = words.map(word =>
Try(model.transform(word)).getOrElse(model.transform("は")))
val breezeVectors: Seq[DenseVector[Double]] = vectors.map(v => new
DenseVector(v.toArray))
val concat = breezeVectors.foldLeft(DenseVector.zeros[Double](vectorLength))((a, b)
=> a :+ b)
concat.toArray
}
val embedVectors: Map[String, Array[Float]] = embeds.map {
case (key, value) => (key, stringToVector(value).map(_.toFloat))
}
val embedModel = new Word2VecModel(embedVectors ++ model.getVectors)