SlideShare a Scribd company logo
Japanese Text Mining
with Scala and Spark
Eduardo Gonzalez
Scala Matsuri 2016
About Me
• Eduardo Gonzalez
• Japan Business Systems
• Japanese System Integrator (SIer)
• Social Systems Design Center (R&D)
• Pittsburgh University
• Computer Science
• Japanese
@wm_eddie
Agenda
• Intro to Text mining with Spark
• Pre-processing Japanese Text
• Japanese Word Breaking
• Spark Gotchas
• Topic Extraction with LDA
• Intro to Word2Vec
• Recommendation with Word Embedding
Machine Learning
Vocabulary
• Feature: A number that represents something
about a data point
• Label: A feature of the data we want to predict
• Document: A block of text with a unique ID
• Model: A learned set of parameters that can
be used for prediction
• Corpus: A collection of documents
機械学習の前提となる語彙としてFeature、Label、Document、Model、Corpusが
ある
What is Apache
Spark
• A library that defines a Resilient Distributed Dataset
type and a set of transformations
• RDDs are only representations of calculations
• A runtime that can execute RDDs in a distributed
manner
• A master process that schedules and monitors executors
• Executors actually do the calculations and can keep results in their
memory
• Spark SQL, MLLib and Graph X define special types of
RDDs
Sparkは汎用分散処理基盤で、SQL/機械学習/グラフといったコンポーネントを保
持する
Apache Spark
Example
import org.apache.spark.{SparkConf, SparkContext}
object Main extends App {
val sc = new SparkContext(new SparkConf())
val text = sc.textFile("hdfs:///kjb.txt")
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.collect().foreach(println)
}
SparkでWordCountアプリケーションを構築するとこのようになる
Spark’s Text-Mining
Tools
• LDA for Topic Extraction
• Word2Vec an unsupervised way to turn words
into features based on their meaning
• CountVectorizer turns documents into vectors
based on word count
• HashingTF-IDF calculates important words of
a document with respect to the corpus
• And much more
SparkのテキストマイニングツールとしてLDA、CountVectorizer、HashingTF-
IDF等のツールがある
How to use Spark
LDA
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/sample_lda_data.txt")
val parsedData = data.map(s => Vectors.dense(s.trim.split('
').map(_.toDouble)))
// Index documents with unique IDs
val corpus = parsedData.zipWithIndex.map(_.swap).cache()
// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)
sample_lda_data.txt
ただ、入力のLDAデータは文章のようには見えない
1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0
(´Д`)
This does not
look like text
LDA Step 0: Get
words
LDA実行にあたり、まずはじめに単語を抽出する必要がある
Word Segmentation
• Hard to actually get right.
• Simple in theory with English
• Str.Split(“ “)
• But not enough for real data.
• (Take parens for example.)
• [“(Take”, “parens”, “for”, “example.)”]
• Etc.
実際の単語抽出は難しく、区切りで分割するだけではうまくいかない
Word Segmentation
• Since Japanese lacks spaces it’s hard even in
theory
• A probabilistic approach is necessary
• Thankfully there are libraries that can help
日本語単語の抽出は単語区切り文字がなく、確率的アプローチが必要、ライブラ
リで効率的に実行できる
Morphological
Analyzers
• Include POS tagging, pronunciation and
stemming
• MeCab
• Written in C++with SWIG bindings to pretty
much everything
• Kuromoji
• Written in Java available via maven
• Others
形態素解析(品詞タグ付け、発音、語幹処理服務)用にMeCabやKuromoji等のラ
イブラリがある
JMecab &
Spark/Hadoop
• Not impossible but difficult
• Add MeCab to each node
• Add jar to classpaths
• Include jar in project for compilation
• Not too bad with own hardware but
painful with Amazon EMR or Azure
HDInsight
JMecabは事前Installが必要なため、オンプレでは何とかなるが、クラウド環境で
は実行困難
Kuromoji &
Spark/Hadoop
• Easy
• Include dependency in build.sbt
• Include jar file in FatJar with sbt-
assembly
Kuromojiは依存性を追加し、FatJarをビルドするだけなので使いやすい
Using Kuromoji
import org.atilika.kuromoji.Tokenizer
object Main extends App {
import scala.collection.JavaConverters.asScalaBufferConverter
val tokenizer = Tokenizer.builder().build()
val ex1 = "リストのような構造の物から条件を満たす物を探す"
val res1 = tokenizer.tokenize(ex1).asScala
for (token <- res1) {
println(s"${token.getBaseForm}t${token.getPartOfSpeech}")
}
}
Using Kuromoji
Kuromojiを使うとこのように認識される
厚生 名詞,一般,*,*
年金 名詞,一般,*,*
基金 名詞,一般,*,*
脱退 名詞,サ変接続,*,*
に 助詞,格助詞,一般,*
伴う 動詞,自立,*,*
手続き 名詞,サ変接続,*,*
について 助詞,格助詞,連語,*
の 助詞,連体化,*,*
リマ 名詞,固有名詞,地域,一般
インド 名詞,固有名詞,地域,国
です 助動詞,*,*,*
リスト 名詞,一般,*,*
の 助詞,連体化,*,*
よう 名詞,非自立,助動詞語幹,*
だ 助動詞,*,*,*
構造 名詞,一般,*,*
の 助詞,連体化,*,*
物 名詞,非自立,一般,*
から 助詞,格助詞,一般,*
条件 名詞,一般,*,*
を 助詞,格助詞,一般,*
満たす 動詞,自立,*,*
物 名詞,非自立,一般,*
を 助詞,格助詞,一般,*
探す 動詞,自立,*,*
Step 1: Build
Vocabulary
語彙の構築
Vocabulary
lazy val tokenizer = Tokenizer.builder().build()
val text = sc.textFile("documents")
val words = for {
line <- text
token <- tokenizer.tokenize(line).asScala
} yield token.getBaseForm
val vocab = words.distinct().zipWithIndex().collectAsMap()
Step 2: Create Corpus
コーパスの作成
Corpus
val documentWords: RDD[Array[String]] =
text.map(line => tokenizer.tokenize(line).asScala.map(t => t.getBaseForm).toArray)
val documentCounts: RDD[Array[(String, Int)]] =
documentWords.map(words => words.distinct.map { word =>
(word, words.count(_ == word))
})
val documentIndexAndCount: RDD[Seq[(Int, Double)]] =
documentCounts.map(wordsAndCount => wordsAndCount.map {
case (word, count) => (vocab(word).toInt, count.toDouble)
})
val corpus: RDD[(Long, Vector)] =
documentIndexAndCount.map(Vectors.sparse(vocab.size,
_)).zipWithIndex.map(_.swap)
Step 3: Learn Topics
トピックモデルの学習
Learn Topics
val ldaModel = new LDA().setK(10).setMaxIterations(100).run(corpus)
val topics = ldaModel.describeTopics(10).map {
case (terms, weights) =>
terms.map(vocabulary(_)).zip(weights)
}
topics.zipWithIndex.foreach {
case (topic, i) =>
println(s"TOPIC $i")
topic.foreach { case (term, weight) => println(s"$termt$weight") }
println(s"==========")
}
Step 4: Evaluate
結果の評価
Topics?
Topic 0:
です 0.10870545899718176。0.09623411796419644さん 0.06105040403724023
Topic 1:
の0.11035671185240525を0.07860862808644907する 0.05605566895190625
Topic 2:
お願い 0.07579177409154919ご0.04431117457391179よろしく 0.032788330612439916
結果は助詞や文章の補助単語になっていた
Step 5: GOTO 2
Filter Stopwords
val popular = words
.map(w => (w, 1))
.reduceByKey(_ + _)
.sortBy(-_._2)
.take(50)
.map(_._1)
.toSet
val vocabIndicies = words.distinct().filter(w =>
!popular.contains(w)).zipWithIndex()
val vocab: Map[String, Long] = vocabIndicies.collectAsMap()
val vocabulary = vocabIndicies.collect().map(_._1)
ストップワードの除去
Topics!
Topic 0:
変更 0.032952997236706624サーバー 0.03140777729144046設定 0.021643554361727567エ
ラー 0.017955380768330902
Topic 1:
ログ 0.028665774057609564時間 0.026686704628121154時 0.02404938565591628発生
0.020797622509804107
Topic 2:
様0.0474658820402456株式会社 0.026174292703953685お世話 0.021939329774535308
Using the LDA model
• Prediction requires a LocalLDAModel
• Use .toLocal if
isInstanceOf[DistributedLDAModel]
• Convert to Vector using same steps
• Be sure to filter out words not in the vocabulary
• Call topicDistributions to see topic scores
LDAモデルはトピックの予想のために使用される
Topics Prediction
New document topics:
0.091084004103132,0.1044111561202625,0.09090943947509807,0.11607354553753861,0.104042
84803971378,0.09697071269561051,0.09571658794577831,0.0919546186785918,0.091762489301
32802,0.11707459810294643
New document topics:
0.09424474530277152,0.1183270779577911,0.09230776874419214,0.09835759337114718,0.1315
9581881630272,0.09279638945611612,0.094124104743527,0.09295449996673977,0.09291472297
512052,0.09237727866629193
トピックの予想
Topic 0 Topic 1 Topic 2 Topic …
Now what?
• Find the minimum logLikelihood in a set
of documents you know are OK
• Report anomaly whenever a new
document has a lower logLikelihood
トピックを正しく予想できた集合の最小対数尤度を計算、新しい文書がその値を
下回ったら「異常」に分類
Anomaly Detection
val newDoc = sc.parallelize(Seq("平素は当社サービスをご利用いただき、誠にありがとうございます。
"))
def stringToCountVector(strings: RDD[String]) = {
. . .
}
val score = lda.logLikelihood(stringToCountVector(newDoc))
/*
-2153492.694125671
*/
Word2Vec
• Created vectors that represents points in
meaning space
• Unsupervised but requires a lot of data to
generate good vectors
• Google’s sample vectors trained on 100
billion words (~X00GB?)
• Vectors with less data can provide
interesting similarities but can’t do so
consistently
Word2Vecでは単語をベクトル化して定量的に表現可能で、単語同士の類似度を
出すことができる
Word2Vec Intuition
• Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.
Linguistic Regularities in Continuous Space Word
Representations. In Proceedings of NAACL HLT, 2013.
実際の単語ベクトル化例
Vector Concatenation
ベクトル連結
ITEM_01
営業
活用
営業
の
情報
共有
と
サポート. . .
Step 1: Make vectors
単語ベクトルの生成
Making
Word2VecModel
val documentWords: RDD[Seq[String]] =
text.map(line => tokenizer.tokenize(line).asScala.map(_.getSurfaceForm).toSeq)
documentWords.cache()
val model = new Word2Vec().setVectorSize(300).fit(documentWords)
Step 2: Use vectors
単語ベクトルの適用
Using
Word2VecModel
model.findSynonyms(“日本”, 5).foreach(println)
/*
(マイクロソフト,3.750299190465294)
(ビジネス,3.7329870992662104)
(株式会社,3.323983664186244)
(システムズ,3.1331352923187987)
(ビジネスプロダクティビティ,2.595931613590554)
*/
実際に単語類似度算出例、ただし、元データで結果は大きく変動するため元デー
タが非常に重要
Big dataset is
very important.
Recommendation
• Paragraph Vectors
• Not available in Spark T_T
文章のベクトル化によるレコメンドはSparkではできない
Embedding with Vector
Concatenation
• Calculate sum of words in description
• Add it to vectors from
Word2VecModel.getVectors with special
keyword (Ex. ITEM_1234)
• Create new Word2VecModel using constructor
• ※Not state of the art but can produce
reasonable recommendations without user
rating data
ベクトル連結による embedding、「アイテム」ごとに含まれる単語のベクトルを
合計する
Item Embedding (1/2)
val embeds = Map(
"ITEM_001_01" -> "営業部門の情報共有と活用をサポートし",
"ITEM_001_02" -> "組織的な営業力・売れる仕組みを構築します",
"ITEM_001_03" -> "営業情報のコミュニケーション基盤を構築する",
"ITEM_002_01" -> "一般的なサーバ、ネットワーク機器やOSレベルの監視に加え",
"ITEM_002_02" -> "またモニタリングポータルでは、アラームの発生状況",
"ITEM_002_03" -> "監視システムにより取得されたパフォーマンス情報が逐次ダッシュボード形式",
"ITEM_003_01" -> "IPネットワークインフラストラクチャを構築します",
"ITEM_003_02" -> "導入にとどまらず、アプリケーションやOAシステムとの融合を図ったユニファイドコミュニ
ケーション環境を構築",
"ITEM_003_03" -> "企業内および企業外へのコンテンツの効果的な配信環境、閲覧環境をご提供します"
)
Item Embedding (2/2)
def stringToVector(s: String): Array[Double] = {
val words = tokenizer.tokenize(s).asScala.map(_.getSurfaceForm).toSeq
val vectors = words.map(word =>
Try(model.transform(word)).getOrElse(model.transform("は")))
val breezeVectors: Seq[DenseVector[Double]] = vectors.map(v => new
DenseVector(v.toArray))
val concat = breezeVectors.foldLeft(DenseVector.zeros[Double](vectorLength))((a, b)
=> a :+ b)
concat.toArray
}
val embedVectors: Map[String, Array[Float]] = embeds.map {
case (key, value) => (key, stringToVector(value).map(_.toFloat))
}
val embedModel = new Word2VecModel(embedVectors ++ model.getVectors)
Recommending
Similar
embedModel.findSynonyms("ITEM_001_01", 5).foreach(println)
/*
(ITEM_001_03,12.577457221575695)
(ITEM_003_03,12.542920930725996)
(ITEM_003_02,12.315240961298104)
(ITEM_001_02,12.260734177166485)
(ITEM_002_01,10.866897938028856)
*/
類似度の計算
Recommending New
val newSentence = stringToVector("会計・受発注及び生産管理を中心としたシステム")
embedModel.findSynonyms(Vectors.dense(newSentence), 5).foreach(println)
/*
(ITEM_001_02,14.372981084681571)
(ITEM_003_03,14.343473534848325)
(ITEM_001_01,13.83593570884867)
(ITEM_002_01,13.61507040314043)
(ITEM_002_03,13.462141195072414)
*/
新しいサンプルからのレコメンド
Thank you
• Questions?
• Example source code at:
• https://github.com/wmeddie/spark-text

More Related Content

What's hot

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Taro L. Saito
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
Stuart Roebuck
 
Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?
Mario Camou Riveroll
 
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...
scalaconfjp
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
Hiroshi Ono
 
Refactoring to Scala DSLs and LiftOff 2009 Recap
Refactoring to Scala DSLs and LiftOff 2009 RecapRefactoring to Scala DSLs and LiftOff 2009 Recap
Refactoring to Scala DSLs and LiftOff 2009 Recap
Dave Orme
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
Alex Payne
 
A Brief, but Dense, Intro to Scala
A Brief, but Dense, Intro to ScalaA Brief, but Dense, Intro to Scala
A Brief, but Dense, Intro to Scala
Derek Chen-Becker
 
Java Serialization Facts and Fallacies
Java Serialization Facts and FallaciesJava Serialization Facts and Fallacies
Java Serialization Facts and Fallacies
Roman Elizarov
 
Scala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuScala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on Heroku
Havoc Pennington
 
Millions quotes per second in pure java
Millions quotes per second in pure javaMillions quotes per second in pure java
Millions quotes per second in pure java
Roman Elizarov
 
[Start] Scala
[Start] Scala[Start] Scala
[Start] Scala
佑介 九岡
 
Scala coated JVM
Scala coated JVMScala coated JVM
Scala coated JVM
Stuart Roebuck
 
Android course session 5 (Threads & socket)
Android course session 5 (Threads & socket)Android course session 5 (Threads & socket)
Android course session 5 (Threads & socket)
Keroles M.Yakoub
 
Scala Introduction
Scala IntroductionScala Introduction
Scala Introduction
Adrian Spender
 
Scala : language of the future
Scala : language of the futureScala : language of the future
Scala : language of the future
AnsviaLab
 
Why GC is eating all my CPU?
Why GC is eating all my CPU?Why GC is eating all my CPU?
Why GC is eating all my CPU?
Roman Elizarov
 
Lagergren jvmls-2013-final
Lagergren jvmls-2013-finalLagergren jvmls-2013-final
Lagergren jvmls-2013-final
Marcus Lagergren
 
Scala in a wild enterprise
Scala in a wild enterpriseScala in a wild enterprise
Scala in a wild enterprise
Rafael Bagmanov
 
Solid and Sustainable Development in Scala
Solid and Sustainable Development in ScalaSolid and Sustainable Development in Scala
Solid and Sustainable Development in Scala
scalaconfjp
 

What's hot (20)

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
 
Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?
 
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
Refactoring to Scala DSLs and LiftOff 2009 Recap
Refactoring to Scala DSLs and LiftOff 2009 RecapRefactoring to Scala DSLs and LiftOff 2009 Recap
Refactoring to Scala DSLs and LiftOff 2009 Recap
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
A Brief, but Dense, Intro to Scala
A Brief, but Dense, Intro to ScalaA Brief, but Dense, Intro to Scala
A Brief, but Dense, Intro to Scala
 
Java Serialization Facts and Fallacies
Java Serialization Facts and FallaciesJava Serialization Facts and Fallacies
Java Serialization Facts and Fallacies
 
Scala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on HerokuScala, Akka, and Play: An Introduction on Heroku
Scala, Akka, and Play: An Introduction on Heroku
 
Millions quotes per second in pure java
Millions quotes per second in pure javaMillions quotes per second in pure java
Millions quotes per second in pure java
 
[Start] Scala
[Start] Scala[Start] Scala
[Start] Scala
 
Scala coated JVM
Scala coated JVMScala coated JVM
Scala coated JVM
 
Android course session 5 (Threads & socket)
Android course session 5 (Threads & socket)Android course session 5 (Threads & socket)
Android course session 5 (Threads & socket)
 
Scala Introduction
Scala IntroductionScala Introduction
Scala Introduction
 
Scala : language of the future
Scala : language of the futureScala : language of the future
Scala : language of the future
 
Why GC is eating all my CPU?
Why GC is eating all my CPU?Why GC is eating all my CPU?
Why GC is eating all my CPU?
 
Lagergren jvmls-2013-final
Lagergren jvmls-2013-finalLagergren jvmls-2013-final
Lagergren jvmls-2013-final
 
Scala in a wild enterprise
Scala in a wild enterpriseScala in a wild enterprise
Scala in a wild enterprise
 
Solid and Sustainable Development in Scala
Solid and Sustainable Development in ScalaSolid and Sustainable Development in Scala
Solid and Sustainable Development in Scala
 

Viewers also liked

Functional Programming For All - Scala Matsuri 2016
Functional Programming For All - Scala Matsuri 2016Functional Programming For All - Scala Matsuri 2016
Functional Programming For All - Scala Matsuri 2016
Zachary Abbott
 
Thinking in Cats
Thinking in CatsThinking in Cats
Thinking in Cats
Eugene Yokota
 
ScalaMatsuri 2016
ScalaMatsuri 2016ScalaMatsuri 2016
ScalaMatsuri 2016
Yoshitaka Fujii
 
Contributing to Scala OSS from East Asia #ScalaMatsuri
 Contributing to Scala OSS from East Asia #ScalaMatsuri Contributing to Scala OSS from East Asia #ScalaMatsuri
Contributing to Scala OSS from East Asia #ScalaMatsuri
Kazuhiro Sera
 
あなたのScalaを爆速にする7つの方法(日本語版)
あなたのScalaを爆速にする7つの方法(日本語版)あなたのScalaを爆速にする7つの方法(日本語版)
あなたのScalaを爆速にする7つの方法(日本語版)
x1 ichi
 
バッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuri
バッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuriバッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuri
バッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuri
Kazuki Negoro
 
Why Reactive Matters #ScalaMatsuri
Why Reactive Matters #ScalaMatsuriWhy Reactive Matters #ScalaMatsuri
Why Reactive Matters #ScalaMatsuri
Yuta Okamoto
 
Zen of Akka
Zen of AkkaZen of Akka
Zen of Akka
Konrad Malawski
 
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Yongzheng (Tiger) Zhang
 
Graphics for big data reference architecture blog
Graphics for big data reference architecture blogGraphics for big data reference architecture blog
Graphics for big data reference architecture blog
Sunil Soares
 
Using Deep Learning for Recommendation
Using Deep Learning for RecommendationUsing Deep Learning for Recommendation
Using Deep Learning for Recommendation
Eduardo Gonzalez
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
Normalization
NormalizationNormalization
Normalization
ochesing
 
形態素解析の過去・現在・未来
形態素解析の過去・現在・未来形態素解析の過去・現在・未来
形態素解析の過去・現在・未来Preferred Networks
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
Shivaji Dutta
 
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & Tricks
SlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
SlideShare
 

Viewers also liked (17)

Functional Programming For All - Scala Matsuri 2016
Functional Programming For All - Scala Matsuri 2016Functional Programming For All - Scala Matsuri 2016
Functional Programming For All - Scala Matsuri 2016
 
Thinking in Cats
Thinking in CatsThinking in Cats
Thinking in Cats
 
ScalaMatsuri 2016
ScalaMatsuri 2016ScalaMatsuri 2016
ScalaMatsuri 2016
 
Contributing to Scala OSS from East Asia #ScalaMatsuri
 Contributing to Scala OSS from East Asia #ScalaMatsuri Contributing to Scala OSS from East Asia #ScalaMatsuri
Contributing to Scala OSS from East Asia #ScalaMatsuri
 
あなたのScalaを爆速にする7つの方法(日本語版)
あなたのScalaを爆速にする7つの方法(日本語版)あなたのScalaを爆速にする7つの方法(日本語版)
あなたのScalaを爆速にする7つの方法(日本語版)
 
バッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuri
バッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuriバッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuri
バッチを Akka Streams で再実装したら100倍速くなった話 #ScalaMatsuri
 
Why Reactive Matters #ScalaMatsuri
Why Reactive Matters #ScalaMatsuriWhy Reactive Matters #ScalaMatsuri
Why Reactive Matters #ScalaMatsuri
 
Zen of Akka
Zen of AkkaZen of Akka
Zen of Akka
 
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining – How Linkedin Built a Text An...
 
Graphics for big data reference architecture blog
Graphics for big data reference architecture blogGraphics for big data reference architecture blog
Graphics for big data reference architecture blog
 
Using Deep Learning for Recommendation
Using Deep Learning for RecommendationUsing Deep Learning for Recommendation
Using Deep Learning for Recommendation
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
Normalization
NormalizationNormalization
Normalization
 
形態素解析の過去・現在・未来
形態素解析の過去・現在・未来形態素解析の過去・現在・未来
形態素解析の過去・現在・未来
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 
How to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & TricksHow to Make Awesome SlideShares: Tips & Tricks
How to Make Awesome SlideShares: Tips & Tricks
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
 

Similar to Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
trihug
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
Josh Patterson
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Caserta
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
lucenerevolution
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
lucenerevolution
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
Mark Kerzner
 
Spark
SparkSpark

Similar to Scala Matsuri 2016: Japanese Text Mining with Scala and Spark (20)

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Spark
SparkSpark
Spark
 

Recently uploaded

Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 

Recently uploaded (20)

Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 

Scala Matsuri 2016: Japanese Text Mining with Scala and Spark

  • 1. Japanese Text Mining with Scala and Spark Eduardo Gonzalez Scala Matsuri 2016
  • 2. About Me • Eduardo Gonzalez • Japan Business Systems • Japanese System Integrator (SIer) • Social Systems Design Center (R&D) • Pittsburgh University • Computer Science • Japanese @wm_eddie
  • 3. Agenda • Intro to Text mining with Spark • Pre-processing Japanese Text • Japanese Word Breaking • Spark Gotchas • Topic Extraction with LDA • Intro to Word2Vec • Recommendation with Word Embedding
  • 4. Machine Learning Vocabulary • Feature: A number that represents something about a data point • Label: A feature of the data we want to predict • Document: A block of text with a unique ID • Model: A learned set of parameters that can be used for prediction • Corpus: A collection of documents 機械学習の前提となる語彙としてFeature、Label、Document、Model、Corpusが ある
  • 5. What is Apache Spark • A library that defines a Resilient Distributed Dataset type and a set of transformations • RDDs are only representations of calculations • A runtime that can execute RDDs in a distributed manner • A master process that schedules and monitors executors • Executors actually do the calculations and can keep results in their memory • Spark SQL, MLLib and Graph X define special types of RDDs Sparkは汎用分散処理基盤で、SQL/機械学習/グラフといったコンポーネントを保 持する
  • 6. Apache Spark Example import org.apache.spark.{SparkConf, SparkContext} object Main extends App { val sc = new SparkContext(new SparkConf()) val text = sc.textFile("hdfs:///kjb.txt") val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.collect().foreach(println) } SparkでWordCountアプリケーションを構築するとこのようになる
  • 7. Spark’s Text-Mining Tools • LDA for Topic Extraction • Word2Vec an unsupervised way to turn words into features based on their meaning • CountVectorizer turns documents into vectors based on word count • HashingTF-IDF calculates important words of a document with respect to the corpus • And much more SparkのテキストマイニングツールとしてLDA、CountVectorizer、HashingTF- IDF等のツールがある
  • 8. How to use Spark LDA import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel} import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("data/mllib/sample_lda_data.txt") val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble))) // Index documents with unique IDs val corpus = parsedData.zipWithIndex.map(_.swap).cache() // Cluster the documents into three topics using LDA val ldaModel = new LDA().setK(3).run(corpus)
  • 9. sample_lda_data.txt ただ、入力のLDAデータは文章のようには見えない 1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0 (´Д`) This does not look like text
  • 10. LDA Step 0: Get words LDA実行にあたり、まずはじめに単語を抽出する必要がある
  • 11. Word Segmentation • Hard to actually get right. • Simple in theory with English • Str.Split(“ “) • But not enough for real data. • (Take parens for example.) • [“(Take”, “parens”, “for”, “example.)”] • Etc. 実際の単語抽出は難しく、区切りで分割するだけではうまくいかない
  • 12. Word Segmentation • Since Japanese lacks spaces it’s hard even in theory • A probabilistic approach is necessary • Thankfully there are libraries that can help 日本語単語の抽出は単語区切り文字がなく、確率的アプローチが必要、ライブラ リで効率的に実行できる
  • 13. Morphological Analyzers • Include POS tagging, pronunciation and stemming • MeCab • Written in C++with SWIG bindings to pretty much everything • Kuromoji • Written in Java available via maven • Others 形態素解析(品詞タグ付け、発音、語幹処理服務)用にMeCabやKuromoji等のラ イブラリがある
  • 14. JMecab & Spark/Hadoop • Not impossible but difficult • Add MeCab to each node • Add jar to classpaths • Include jar in project for compilation • Not too bad with own hardware but painful with Amazon EMR or Azure HDInsight JMecabは事前Installが必要なため、オンプレでは何とかなるが、クラウド環境で は実行困難
  • 15. Kuromoji & Spark/Hadoop • Easy • Include dependency in build.sbt • Include jar file in FatJar with sbt- assembly Kuromojiは依存性を追加し、FatJarをビルドするだけなので使いやすい
  • 16. Using Kuromoji import org.atilika.kuromoji.Tokenizer object Main extends App { import scala.collection.JavaConverters.asScalaBufferConverter val tokenizer = Tokenizer.builder().build() val ex1 = "リストのような構造の物から条件を満たす物を探す" val res1 = tokenizer.tokenize(ex1).asScala for (token <- res1) { println(s"${token.getBaseForm}t${token.getPartOfSpeech}") } }
  • 17. Using Kuromoji Kuromojiを使うとこのように認識される 厚生 名詞,一般,*,* 年金 名詞,一般,*,* 基金 名詞,一般,*,* 脱退 名詞,サ変接続,*,* に 助詞,格助詞,一般,* 伴う 動詞,自立,*,* 手続き 名詞,サ変接続,*,* について 助詞,格助詞,連語,* の 助詞,連体化,*,* リマ 名詞,固有名詞,地域,一般 インド 名詞,固有名詞,地域,国 です 助動詞,*,*,* リスト 名詞,一般,*,* の 助詞,連体化,*,* よう 名詞,非自立,助動詞語幹,* だ 助動詞,*,*,* 構造 名詞,一般,*,* の 助詞,連体化,*,* 物 名詞,非自立,一般,* から 助詞,格助詞,一般,* 条件 名詞,一般,*,* を 助詞,格助詞,一般,* 満たす 動詞,自立,*,* 物 名詞,非自立,一般,* を 助詞,格助詞,一般,* 探す 動詞,自立,*,*
  • 19. Vocabulary lazy val tokenizer = Tokenizer.builder().build() val text = sc.textFile("documents") val words = for { line <- text token <- tokenizer.tokenize(line).asScala } yield token.getBaseForm val vocab = words.distinct().zipWithIndex().collectAsMap()
  • 20. Step 2: Create Corpus コーパスの作成
  • 21. Corpus val documentWords: RDD[Array[String]] = text.map(line => tokenizer.tokenize(line).asScala.map(t => t.getBaseForm).toArray) val documentCounts: RDD[Array[(String, Int)]] = documentWords.map(words => words.distinct.map { word => (word, words.count(_ == word)) }) val documentIndexAndCount: RDD[Seq[(Int, Double)]] = documentCounts.map(wordsAndCount => wordsAndCount.map { case (word, count) => (vocab(word).toInt, count.toDouble) }) val corpus: RDD[(Long, Vector)] = documentIndexAndCount.map(Vectors.sparse(vocab.size, _)).zipWithIndex.map(_.swap)
  • 22. Step 3: Learn Topics トピックモデルの学習
  • 23. Learn Topics val ldaModel = new LDA().setK(10).setMaxIterations(100).run(corpus) val topics = ldaModel.describeTopics(10).map { case (terms, weights) => terms.map(vocabulary(_)).zip(weights) } topics.zipWithIndex.foreach { case (topic, i) => println(s"TOPIC $i") topic.foreach { case (term, weight) => println(s"$termt$weight") } println(s"==========") }
  • 25. Topics? Topic 0: です 0.10870545899718176。0.09623411796419644さん 0.06105040403724023 Topic 1: の0.11035671185240525を0.07860862808644907する 0.05605566895190625 Topic 2: お願い 0.07579177409154919ご0.04431117457391179よろしく 0.032788330612439916 結果は助詞や文章の補助単語になっていた
  • 27. Filter Stopwords val popular = words .map(w => (w, 1)) .reduceByKey(_ + _) .sortBy(-_._2) .take(50) .map(_._1) .toSet val vocabIndicies = words.distinct().filter(w => !popular.contains(w)).zipWithIndex() val vocab: Map[String, Long] = vocabIndicies.collectAsMap() val vocabulary = vocabIndicies.collect().map(_._1) ストップワードの除去
  • 28. Topics! Topic 0: 変更 0.032952997236706624サーバー 0.03140777729144046設定 0.021643554361727567エ ラー 0.017955380768330902 Topic 1: ログ 0.028665774057609564時間 0.026686704628121154時 0.02404938565591628発生 0.020797622509804107 Topic 2: 様0.0474658820402456株式会社 0.026174292703953685お世話 0.021939329774535308
  • 29. Using the LDA model • Prediction requires a LocalLDAModel • Use .toLocal if isInstanceOf[DistributedLDAModel] • Convert to Vector using same steps • Be sure to filter out words not in the vocabulary • Call topicDistributions to see topic scores LDAモデルはトピックの予想のために使用される
  • 30. Topics Prediction New document topics: 0.091084004103132,0.1044111561202625,0.09090943947509807,0.11607354553753861,0.104042 84803971378,0.09697071269561051,0.09571658794577831,0.0919546186785918,0.091762489301 32802,0.11707459810294643 New document topics: 0.09424474530277152,0.1183270779577911,0.09230776874419214,0.09835759337114718,0.1315 9581881630272,0.09279638945611612,0.094124104743527,0.09295449996673977,0.09291472297 512052,0.09237727866629193 トピックの予想 Topic 0 Topic 1 Topic 2 Topic …
  • 31. Now what? • Find the minimum logLikelihood in a set of documents you know are OK • Report anomaly whenever a new document has a lower logLikelihood トピックを正しく予想できた集合の最小対数尤度を計算、新しい文書がその値を 下回ったら「異常」に分類
  • 32. Anomaly Detection val newDoc = sc.parallelize(Seq("平素は当社サービスをご利用いただき、誠にありがとうございます。 ")) def stringToCountVector(strings: RDD[String]) = { . . . } val score = lda.logLikelihood(stringToCountVector(newDoc)) /* -2153492.694125671 */
  • 33. Word2Vec • Created vectors that represents points in meaning space • Unsupervised but requires a lot of data to generate good vectors • Google’s sample vectors trained on 100 billion words (~X00GB?) • Vectors with less data can provide interesting similarities but can’t do so consistently Word2Vecでは単語をベクトル化して定量的に表現可能で、単語同士の類似度を 出すことができる
  • 34. Word2Vec Intuition • Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013. 実際の単語ベクトル化例
  • 36. Step 1: Make vectors 単語ベクトルの生成
  • 37. Making Word2VecModel val documentWords: RDD[Seq[String]] = text.map(line => tokenizer.tokenize(line).asScala.map(_.getSurfaceForm).toSeq) documentWords.cache() val model = new Word2Vec().setVectorSize(300).fit(documentWords)
  • 38. Step 2: Use vectors 単語ベクトルの適用
  • 40. Recommendation • Paragraph Vectors • Not available in Spark T_T 文章のベクトル化によるレコメンドはSparkではできない
  • 41. Embedding with Vector Concatenation • Calculate sum of words in description • Add it to vectors from Word2VecModel.getVectors with special keyword (Ex. ITEM_1234) • Create new Word2VecModel using constructor • ※Not state of the art but can produce reasonable recommendations without user rating data ベクトル連結による embedding、「アイテム」ごとに含まれる単語のベクトルを 合計する
  • 42. Item Embedding (1/2) val embeds = Map( "ITEM_001_01" -> "営業部門の情報共有と活用をサポートし", "ITEM_001_02" -> "組織的な営業力・売れる仕組みを構築します", "ITEM_001_03" -> "営業情報のコミュニケーション基盤を構築する", "ITEM_002_01" -> "一般的なサーバ、ネットワーク機器やOSレベルの監視に加え", "ITEM_002_02" -> "またモニタリングポータルでは、アラームの発生状況", "ITEM_002_03" -> "監視システムにより取得されたパフォーマンス情報が逐次ダッシュボード形式", "ITEM_003_01" -> "IPネットワークインフラストラクチャを構築します", "ITEM_003_02" -> "導入にとどまらず、アプリケーションやOAシステムとの融合を図ったユニファイドコミュニ ケーション環境を構築", "ITEM_003_03" -> "企業内および企業外へのコンテンツの効果的な配信環境、閲覧環境をご提供します" )
  • 43. Item Embedding (2/2) def stringToVector(s: String): Array[Double] = { val words = tokenizer.tokenize(s).asScala.map(_.getSurfaceForm).toSeq val vectors = words.map(word => Try(model.transform(word)).getOrElse(model.transform("は"))) val breezeVectors: Seq[DenseVector[Double]] = vectors.map(v => new DenseVector(v.toArray)) val concat = breezeVectors.foldLeft(DenseVector.zeros[Double](vectorLength))((a, b) => a :+ b) concat.toArray } val embedVectors: Map[String, Array[Float]] = embeds.map { case (key, value) => (key, stringToVector(value).map(_.toFloat)) } val embedModel = new Word2VecModel(embedVectors ++ model.getVectors)
  • 45. Recommending New val newSentence = stringToVector("会計・受発注及び生産管理を中心としたシステム") embedModel.findSynonyms(Vectors.dense(newSentence), 5).foreach(println) /* (ITEM_001_02,14.372981084681571) (ITEM_003_03,14.343473534848325) (ITEM_001_01,13.83593570884867) (ITEM_002_01,13.61507040314043) (ITEM_002_03,13.462141195072414) */ 新しいサンプルからのレコメンド
  • 46. Thank you • Questions? • Example source code at: • https://github.com/wmeddie/spark-text