SlideShare a Scribd company logo
1 of 35
Running Word2Vec with Chinese Wikipedia
dump
Similarity
1. if two words have high similarity, it means they have strong
relationship
2. use wikipedia to let machine has general sense about our
world
"魯夫" is main charactrer in "海賊王"
"東京" is capital city in "日本"
Related Application
1. voice-driven assistants
(Siri, Google Now, Microsoft Cortana)
2. e-commerce recommandation
(Alibaba, Rakuten)
3. question answering(IBM Waston)
4. others(Flipboard, SmartNews)
Related Application
Build you own smart AI
My current progress
Download Wikipedia
1. https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-
pages-articles.xml.bz2
2. it contains traditional chinese and simplified chinese
articles
3. 1G file size, 230,000 articles, 150,000,000 words
Preprocessing
1. use OpenCC to translate from simplified chinese to
traditional chinese
2. support C、C++、Python、PHP、Java、Ruby、Node.js
3. compatible with Linux, Windows and Mac
4. “智能手机” -> “智慧手機”, “信息” -> “資訊”
5. you can play it on the website http://opencc.byvoid.com/
opencc -i zhwiki.txt -o twwiki.txt -c /usr/share/opencc/s2twp.json
Preprocessing
1. use gensim to extract article from Wikipedia dump
2. 2G memory is required
Preprocessing
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
inp, outp = sys.argv[1:3]
output = open(outp,'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
output.write(space.join(text) + "n")
output.close()
gensim provides iterator to extract sentences from
compressed wiki dump
Segmentation
1. english uses some notation(whitespace, dot, etc) to
separate words,
but not all language follow this practice
2. "下雨天/留客天/留我/不留", "下雨/天留客/天留/我不留"
3. new word keep to be generated(such as "小確幸", "物聯網")
Segmentation
Jieba supports full and search mode
#encoding=UTF-8
import jieba
if __name__ == '__main__':
input_str = u'今天讓我們來測試中文斷詞'
seg_list = jieba.cut(input_str, cut_all=True) # full mode
print(', '.join(seg_list))
seg_list = jieba.cut(input_str, cut_all=False) # search mode
print(', '.join(seg_list))
今天, 讓, 我, 們, 來, 測, 試, 中文, 斷, 詞
今天, 讓, 我們, 來, 測試, 中文, 斷詞
Segmentation
sometimes the result is a little bit funny
#encoding=UTF-8
import jieba
if __name__ == '__main__':
input_str = u'張無忌來大都找我吧!哈哈哈哈哈哈'
seg_list = jieba.cut(input_str, cut_all=False)
print(', '.join(seg_list))
張無忌, 來, 大都, 找, 我, 吧, !, 哈哈哈, 哈哈哈
Segmentation
good dictionary, good result
#encoding=UTF-8
import jieba
if __name__ == '__main__':
input_str = u'舒潔衛生紙買一送一'
seg_list = jieba.cut(input_str, cut_all=False)
print(', '.join(seg_list))
jieba.set_dictionary('./data/dict.txt.big')
seg_list = jieba.cut(input_str, cut_all=False)
print(', '.join(seg_list))
舒潔衛, 生紙, 買, 一送, 一
舒潔, 衛生紙, 買一送一
Segmentation
verb? nouns? adjective? adverb?
#encoding=UTF-8
import pseg
if __name__ == '__main__':
input_str = u'今天讓我們來測試中文斷詞'
seg_list = pseg.cut(input_str)
for seg, flag in seg_list:
print u'{}:{}'.format(seg, flag)
今天:t 讓:v 我們:r 來:v 測試:vn 中文:nz 斷詞:n
Segmentation
keyword extraction
#encoding=UTF-8
import jieba
import jieba.analyse
if __name__ == '__main__':
input_str = u'我的故鄉在台灣, I am Taiwanese'
jieba.set_dictionary('./data/dict.txt.big')
seg_list = jieba.analyse.extract_tags(input_str, topK=3)
print(', '.join(seg_list))
jieba.analyse.set_stop_words('./data/stop_words.txt')
seg_list = jieba.analyse.extract_tags(input_str, topK=3)
print(', '.join(seg_list))
台灣, am, 故鄉
台灣, 故鄉, Taiwanese
Finding Similarity
1. How to do that ? Word2Vec is super star !
Word2Vec
transform from word to vector, distance between vector
implies degree of similarity
vector("首爾") - vector("日本") > vector("東京") - vector("日本")
vector("東京") - vector("日本") + vector("首爾") = vector("南韓")
Word2Vec
word2vec targets the word is asked to predict the
surrounding context
在日本,[ 青森 的 "蘋果" 又 甜 ]又好吃
今年,新版的[ Macbook 是 "蘋果" 發表 的 ]重點之一
"青森" and "Macbook" have high simlaritiy with “蘋果"
training from previous window, "青森" and "日本" also have
high simlaritiy
Word2Vec
word2vec uses skip-gram neural network to predict neighbor
context
Training Word2Vec model by gensim
words already preprocessed and separated by whitespace.
#encoding=UTF-8
import multiprocessing
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
if __name__ == '__main__':
inp = sys.argv[1]
model = Word2Vec(LineSentence(inp),
size=100,
window=10,
min_count=10,
workers=multiprocessing.cpu_count())
it doesn't work for me, gensim's word2vec run out of memory
Move to Spark MLlib
1. Spark offer over 80 operators that make it easy to build
parallel application
2. Databrick company uses Spark to break world record in
2014 1TB sort benchmark completition
3. MLlib is Spark's machine learning library.
Spark cluster overview
1. Spark is master-slave architecture, which likes YARN
2. cluster management is master, it handle resource
managemnet and slave health management.
3. when you launch application,
master will assign a slave to be driver.
driver request resource from master,
execute main function and assign task to slave
Spark cluster deployment
1. use Linode API to create and boot new instance rapidly
2. use standalone Spark cluster
it also can deploy on Mesos or YARN cluster
3. install Java,Scala and put pre-built Spark, finally launch
slave executor!
4. use ansible to deploy spark executor and use LZ4 to speed
up decompress pre-built Spark package
Training Word2Vec model by Spark cluster
RDD is the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements
that can be operated on in parallel
val input:RDD[String] = sc.textFile(inp, 5).cache()
val token:RDD[Seq[String]] = input.map(article => tokenize(article))
val word2vec = new Word2Vec()
word2vec.setNumPartitions(5)
val model = word2vec.fit(token)
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs://....")
Querying Word2Vec model by Spark cluster
val model = sc.objectFile[Word2VecModel]("hdfs://....").first()
val synonyms = model.findSynonyms("熱火",10)
for((synonyms, cosineSim) <- synonyms){
println(synonyms+":"+cosineSim)
}
load model from HDFS
compare with model training, resource requirement is cheap
on finding similarity
Query Word2Vec by Spark cluster
Example of "man"
Example of "luffy"(one piece comic's man
charactrer)
Example of "cell phone"
Thank you

More Related Content

Viewers also liked

Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes👋 Christopher Moody
 
Word2vec (中文)
Word2vec (中文)Word2vec (中文)
Word2vec (中文)Yiwei Chen
 
Machine Learning : comparing neural network methods
Machine Learning : comparing neural network methodsMachine Learning : comparing neural network methods
Machine Learning : comparing neural network methodsNichochar
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...Spark Summit
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep LearningAdam Gibson
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
 
Image Recognition with TensorFlow
Image Recognition with TensorFlowImage Recognition with TensorFlow
Image Recognition with TensorFlowAltoros
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practicehen_drik
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
淺談HTTP發展趨勢與SPDY
淺談HTTP發展趨勢與SPDY淺談HTTP發展趨勢與SPDY
淺談HTTP發展趨勢與SPDYBilly Yang
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
 
Recent Progress in RNN and NLP
Recent Progress in RNN and NLPRecent Progress in RNN and NLP
Recent Progress in RNN and NLPhytae
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from IntelEdge AI and Vision Alliance
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithmAndrew Koo
 
Lecture 06 marco aurelio ranzato - deep learning
Lecture 06   marco aurelio ranzato - deep learningLecture 06   marco aurelio ranzato - deep learning
Lecture 06 marco aurelio ranzato - deep learningmustafa sarac
 
藏頭詩產生器
藏頭詩產生器藏頭詩產生器
藏頭詩產生器Mark Chang
 
Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~Yuya Unno
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowAltoros
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 

Viewers also liked (20)

Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes
 
Word2vec (中文)
Word2vec (中文)Word2vec (中文)
Word2vec (中文)
 
Machine Learning : comparing neural network methods
Machine Learning : comparing neural network methodsMachine Learning : comparing neural network methods
Machine Learning : comparing neural network methods
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
Extending Word2Vec for Performance and Semi-Supervised Learning-(Michael Mala...
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
Image Recognition with TensorFlow
Image Recognition with TensorFlowImage Recognition with TensorFlow
Image Recognition with TensorFlow
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
淺談HTTP發展趨勢與SPDY
淺談HTTP發展趨勢與SPDY淺談HTTP發展趨勢與SPDY
淺談HTTP發展趨勢與SPDY
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Recent Progress in RNN and NLP
Recent Progress in RNN and NLPRecent Progress in RNN and NLP
Recent Progress in RNN and NLP
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithm
 
Lecture 06 marco aurelio ranzato - deep learning
Lecture 06   marco aurelio ranzato - deep learningLecture 06   marco aurelio ranzato - deep learning
Lecture 06 marco aurelio ranzato - deep learning
 
藏頭詩產生器
藏頭詩產生器藏頭詩產生器
藏頭詩產生器
 
Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 

Similar to Running Word2Vec with Chinese Wikipedia dump

College Project - Java Disassembler - Description
College Project - Java Disassembler - DescriptionCollege Project - Java Disassembler - Description
College Project - Java Disassembler - DescriptionGanesh Samarthyam
 
Sparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkSparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkIan Pointer
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesJamund Ferguson
 
Functional (web) development with Clojure
Functional (web) development with ClojureFunctional (web) development with Clojure
Functional (web) development with ClojureHenrik Eneroth
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit AutomationMoabi.com
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory AnalysisMoabi.com
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR mattersAlexandre Moneger
 
Work Queues
Work QueuesWork Queues
Work Queuesciconf
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory AnalysisMoabi.com
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys AdminsPuppet
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1AjayRawat971036
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Railsfreelancing_god
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
Gearman and CodeIgniter
Gearman and CodeIgniterGearman and CodeIgniter
Gearman and CodeIgniterErik Giberti
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory AnalysisMoabi.com
 
Inside Bokete: Web Application with Mojolicious and others
Inside Bokete:  Web Application with Mojolicious and othersInside Bokete:  Web Application with Mojolicious and others
Inside Bokete: Web Application with Mojolicious and othersYusuke Wada
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 

Similar to Running Word2Vec with Chinese Wikipedia dump (20)

College Project - Java Disassembler - Description
College Project - Java Disassembler - DescriptionCollege Project - Java Disassembler - Description
College Project - Java Disassembler - Description
 
Sparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With SparkSparklife - Life In The Trenches With Spark
Sparklife - Life In The Trenches With Spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
DSLs in JavaScript
DSLs in JavaScriptDSLs in JavaScript
DSLs in JavaScript
 
Functional (web) development with Clojure
Functional (web) development with ClojureFunctional (web) development with Clojure
Functional (web) development with Clojure
 
[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation[HITB Malaysia 2011] Exploit Automation
[HITB Malaysia 2011] Exploit Automation
 
[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis[Kiwicon 2011] Post Memory Corruption Memory Analysis
[Kiwicon 2011] Post Memory Corruption Memory Analysis
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters
 
Work Queues
Work QueuesWork Queues
Work Queues
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys Admins
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
 
Solving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with RailsSolving the Riddle of Search: Using Sphinx with Rails
Solving the Riddle of Search: Using Sphinx with Rails
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Gearman and CodeIgniter
Gearman and CodeIgniterGearman and CodeIgniter
Gearman and CodeIgniter
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
Inside Bokete: Web Application with Mojolicious and others
Inside Bokete:  Web Application with Mojolicious and othersInside Bokete:  Web Application with Mojolicious and others
Inside Bokete: Web Application with Mojolicious and others
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Running Word2Vec with Chinese Wikipedia dump

  • 1. Running Word2Vec with Chinese Wikipedia dump
  • 2. Similarity 1. if two words have high similarity, it means they have strong relationship 2. use wikipedia to let machine has general sense about our world "魯夫" is main charactrer in "海賊王" "東京" is capital city in "日本"
  • 3. Related Application 1. voice-driven assistants (Siri, Google Now, Microsoft Cortana) 2. e-commerce recommandation (Alibaba, Rakuten) 3. question answering(IBM Waston) 4. others(Flipboard, SmartNews)
  • 5. Build you own smart AI
  • 6.
  • 8. Download Wikipedia 1. https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest- pages-articles.xml.bz2 2. it contains traditional chinese and simplified chinese articles 3. 1G file size, 230,000 articles, 150,000,000 words
  • 9. Preprocessing 1. use OpenCC to translate from simplified chinese to traditional chinese 2. support C、C++、Python、PHP、Java、Ruby、Node.js 3. compatible with Linux, Windows and Mac 4. “智能手机” -> “智慧手機”, “信息” -> “資訊” 5. you can play it on the website http://opencc.byvoid.com/ opencc -i zhwiki.txt -o twwiki.txt -c /usr/share/opencc/s2twp.json
  • 10. Preprocessing 1. use gensim to extract article from Wikipedia dump 2. 2G memory is required
  • 11. Preprocessing from gensim.corpora import WikiCorpus if __name__ == '__main__': inp, outp = sys.argv[1:3] output = open(outp,'w') wiki = WikiCorpus(inp, lemmatize=False, dictionary={}) for text in wiki.get_texts(): output.write(space.join(text) + "n") output.close() gensim provides iterator to extract sentences from compressed wiki dump
  • 12. Segmentation 1. english uses some notation(whitespace, dot, etc) to separate words, but not all language follow this practice 2. "下雨天/留客天/留我/不留", "下雨/天留客/天留/我不留" 3. new word keep to be generated(such as "小確幸", "物聯網")
  • 13. Segmentation Jieba supports full and search mode #encoding=UTF-8 import jieba if __name__ == '__main__': input_str = u'今天讓我們來測試中文斷詞' seg_list = jieba.cut(input_str, cut_all=True) # full mode print(', '.join(seg_list)) seg_list = jieba.cut(input_str, cut_all=False) # search mode print(', '.join(seg_list)) 今天, 讓, 我, 們, 來, 測, 試, 中文, 斷, 詞 今天, 讓, 我們, 來, 測試, 中文, 斷詞
  • 14. Segmentation sometimes the result is a little bit funny #encoding=UTF-8 import jieba if __name__ == '__main__': input_str = u'張無忌來大都找我吧!哈哈哈哈哈哈' seg_list = jieba.cut(input_str, cut_all=False) print(', '.join(seg_list)) 張無忌, 來, 大都, 找, 我, 吧, !, 哈哈哈, 哈哈哈
  • 15. Segmentation good dictionary, good result #encoding=UTF-8 import jieba if __name__ == '__main__': input_str = u'舒潔衛生紙買一送一' seg_list = jieba.cut(input_str, cut_all=False) print(', '.join(seg_list)) jieba.set_dictionary('./data/dict.txt.big') seg_list = jieba.cut(input_str, cut_all=False) print(', '.join(seg_list)) 舒潔衛, 生紙, 買, 一送, 一 舒潔, 衛生紙, 買一送一
  • 16. Segmentation verb? nouns? adjective? adverb? #encoding=UTF-8 import pseg if __name__ == '__main__': input_str = u'今天讓我們來測試中文斷詞' seg_list = pseg.cut(input_str) for seg, flag in seg_list: print u'{}:{}'.format(seg, flag) 今天:t 讓:v 我們:r 來:v 測試:vn 中文:nz 斷詞:n
  • 17. Segmentation keyword extraction #encoding=UTF-8 import jieba import jieba.analyse if __name__ == '__main__': input_str = u'我的故鄉在台灣, I am Taiwanese' jieba.set_dictionary('./data/dict.txt.big') seg_list = jieba.analyse.extract_tags(input_str, topK=3) print(', '.join(seg_list)) jieba.analyse.set_stop_words('./data/stop_words.txt') seg_list = jieba.analyse.extract_tags(input_str, topK=3) print(', '.join(seg_list)) 台灣, am, 故鄉 台灣, 故鄉, Taiwanese
  • 18. Finding Similarity 1. How to do that ? Word2Vec is super star !
  • 19. Word2Vec transform from word to vector, distance between vector implies degree of similarity vector("首爾") - vector("日本") > vector("東京") - vector("日本") vector("東京") - vector("日本") + vector("首爾") = vector("南韓")
  • 20. Word2Vec word2vec targets the word is asked to predict the surrounding context 在日本,[ 青森 的 "蘋果" 又 甜 ]又好吃 今年,新版的[ Macbook 是 "蘋果" 發表 的 ]重點之一 "青森" and "Macbook" have high simlaritiy with “蘋果" training from previous window, "青森" and "日本" also have high simlaritiy
  • 21. Word2Vec word2vec uses skip-gram neural network to predict neighbor context
  • 22. Training Word2Vec model by gensim words already preprocessed and separated by whitespace. #encoding=UTF-8 import multiprocessing from gensim.corpora import WikiCorpus from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence if __name__ == '__main__': inp = sys.argv[1] model = Word2Vec(LineSentence(inp), size=100, window=10, min_count=10, workers=multiprocessing.cpu_count()) it doesn't work for me, gensim's word2vec run out of memory
  • 23. Move to Spark MLlib 1. Spark offer over 80 operators that make it easy to build parallel application 2. Databrick company uses Spark to break world record in 2014 1TB sort benchmark completition 3. MLlib is Spark's machine learning library.
  • 24. Spark cluster overview 1. Spark is master-slave architecture, which likes YARN 2. cluster management is master, it handle resource managemnet and slave health management. 3. when you launch application, master will assign a slave to be driver. driver request resource from master, execute main function and assign task to slave
  • 25. Spark cluster deployment 1. use Linode API to create and boot new instance rapidly 2. use standalone Spark cluster it also can deploy on Mesos or YARN cluster 3. install Java,Scala and put pre-built Spark, finally launch slave executor! 4. use ansible to deploy spark executor and use LZ4 to speed up decompress pre-built Spark package
  • 26. Training Word2Vec model by Spark cluster RDD is the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel val input:RDD[String] = sc.textFile(inp, 5).cache() val token:RDD[Seq[String]] = input.map(article => tokenize(article)) val word2vec = new Word2Vec() word2vec.setNumPartitions(5) val model = word2vec.fit(token) sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs://....")
  • 27. Querying Word2Vec model by Spark cluster val model = sc.objectFile[Word2VecModel]("hdfs://....").first() val synonyms = model.findSynonyms("熱火",10) for((synonyms, cosineSim) <- synonyms){ println(synonyms+":"+cosineSim) } load model from HDFS compare with model training, resource requirement is cheap on finding similarity
  • 28. Query Word2Vec by Spark cluster
  • 29.
  • 31. Example of "luffy"(one piece comic's man charactrer)
  • 32.
  • 34.