1.personal experience running spark mllib word2vec algorithm
2.use ansible and linode api to create/destroy linode instance rapidly
3.use akka, d3.js to build web backend server
4.I also use react.js, t-SNE and jieba in this work.
2. Similarity
1. if two words have high similarity, it means they have strong
relationship
2. use wikipedia to let machine has general sense about our
world
"魯夫" is main charactrer in "海賊王"
"東京" is capital city in "日本"
3. Related Application
1. voice-driven assistants
(Siri, Google Now, Microsoft Cortana)
2. e-commerce recommandation
(Alibaba, Rakuten)
3. question answering(IBM Waston)
4. others(Flipboard, SmartNews)
9. Preprocessing
1. use OpenCC to translate from simplified chinese to
traditional chinese
2. support C、C++、Python、PHP、Java、Ruby、Node.js
3. compatible with Linux, Windows and Mac
4. “智能手机” -> “智慧手機”, “信息” -> “資訊”
5. you can play it on the website http://opencc.byvoid.com/
opencc -i zhwiki.txt -o twwiki.txt -c /usr/share/opencc/s2twp.json
11. Preprocessing
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
inp, outp = sys.argv[1:3]
output = open(outp,'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
output.write(space.join(text) + "n")
output.close()
gensim provides iterator to extract sentences from
compressed wiki dump
12. Segmentation
1. english uses some notation(whitespace, dot, etc) to
separate words,
but not all language follow this practice
2. "下雨天/留客天/留我/不留", "下雨/天留客/天留/我不留"
3. new word keep to be generated(such as "小確幸", "物聯網")
14. Segmentation
sometimes the result is a little bit funny
#encoding=UTF-8
import jieba
if __name__ == '__main__':
input_str = u'張無忌來大都找我吧!哈哈哈哈哈哈'
seg_list = jieba.cut(input_str, cut_all=False)
print(', '.join(seg_list))
張無忌, 來, 大都, 找, 我, 吧, !, 哈哈哈, 哈哈哈
15. Segmentation
good dictionary, good result
#encoding=UTF-8
import jieba
if __name__ == '__main__':
input_str = u'舒潔衛生紙買一送一'
seg_list = jieba.cut(input_str, cut_all=False)
print(', '.join(seg_list))
jieba.set_dictionary('./data/dict.txt.big')
seg_list = jieba.cut(input_str, cut_all=False)
print(', '.join(seg_list))
舒潔衛, 生紙, 買, 一送, 一
舒潔, 衛生紙, 買一送一
16. Segmentation
verb? nouns? adjective? adverb?
#encoding=UTF-8
import pseg
if __name__ == '__main__':
input_str = u'今天讓我們來測試中文斷詞'
seg_list = pseg.cut(input_str)
for seg, flag in seg_list:
print u'{}:{}'.format(seg, flag)
今天:t 讓:v 我們:r 來:v 測試:vn 中文:nz 斷詞:n
19. Word2Vec
transform from word to vector, distance between vector
implies degree of similarity
vector("首爾") - vector("日本") > vector("東京") - vector("日本")
vector("東京") - vector("日本") + vector("首爾") = vector("南韓")
20. Word2Vec
word2vec targets the word is asked to predict the
surrounding context
在日本,[ 青森 的 "蘋果" 又 甜 ]又好吃
今年,新版的[ Macbook 是 "蘋果" 發表 的 ]重點之一
"青森" and "Macbook" have high simlaritiy with “蘋果"
training from previous window, "青森" and "日本" also have
high simlaritiy
22. Training Word2Vec model by gensim
words already preprocessed and separated by whitespace.
#encoding=UTF-8
import multiprocessing
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
if __name__ == '__main__':
inp = sys.argv[1]
model = Word2Vec(LineSentence(inp),
size=100,
window=10,
min_count=10,
workers=multiprocessing.cpu_count())
it doesn't work for me, gensim's word2vec run out of memory
23. Move to Spark MLlib
1. Spark offer over 80 operators that make it easy to build
parallel application
2. Databrick company uses Spark to break world record in
2014 1TB sort benchmark completition
3. MLlib is Spark's machine learning library.
24. Spark cluster overview
1. Spark is master-slave architecture, which likes YARN
2. cluster management is master, it handle resource
managemnet and slave health management.
3. when you launch application,
master will assign a slave to be driver.
driver request resource from master,
execute main function and assign task to slave
25. Spark cluster deployment
1. use Linode API to create and boot new instance rapidly
2. use standalone Spark cluster
it also can deploy on Mesos or YARN cluster
3. install Java,Scala and put pre-built Spark, finally launch
slave executor!
4. use ansible to deploy spark executor and use LZ4 to speed
up decompress pre-built Spark package
26. Training Word2Vec model by Spark cluster
RDD is the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements
that can be operated on in parallel
val input:RDD[String] = sc.textFile(inp, 5).cache()
val token:RDD[Seq[String]] = input.map(article => tokenize(article))
val word2vec = new Word2Vec()
word2vec.setNumPartitions(5)
val model = word2vec.fit(token)
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs://....")
27. Querying Word2Vec model by Spark cluster
val model = sc.objectFile[Word2VecModel]("hdfs://....").first()
val synonyms = model.findSynonyms("熱火",10)
for((synonyms, cosineSim) <- synonyms){
println(synonyms+":"+cosineSim)
}
load model from HDFS
compare with model training, resource requirement is cheap
on finding similarity