17. Find Word / Doc Similarity with
Deep Learning
Using word2vec and Gensim (Python)
18. Goal (or Problem to Solve)
• Problem: Tech Support engineers (TS) want to “precisely” categorize
support cases. The task is being performed manually by TS engineers.
• Goal: Automatically categorize support case.
• What I have:
• 156 classified cases (with “so-called” correct issue categories)
• Support cases in database
• Challenges:
• Based on current data available, supervised classification algorisms can‘t be
applied.
• Clustering may not 100% achieve the goal.
• What about Deep Learning?
19. Gensim (word2vec implementation in Python)
from os import listdir
import gensim
LabeledSentence = gensim.models.doc2vec.LabeledSentence
docLabels = []
docLabels = [f for f in listdir(“../corpora/2016/”) if f.endswith(‘.txt’)]
data = []
for doc in docLabels:
data.append(open(“../corpora/2016/” + doc, ‘r’))
class LabeledLineSentence(object):
def __init__(self, doc_list, labels_list):
self.labels_list = labels_list
self.doc_list = doc_list
def __iter__(self):
for idx, doc in enumerate(self.doc_list):
yield LabeledSentence(words=doc.read().split(),
labels=[self.labels_list[idx]])
20. Gensim (Cont’d)
it = LabeledLineSentence(data, docLabels)
model = gensim.models.Doc2Vec(alpha=0.025,
min_alpha=0.025)
model.build_vocab(it)
for epoch in range(10):
model.train(it)
model.alpha -= 0.002
model.min_alpha = model.alpha
# find most similar support case
print model.most_similar(“00111105”)
21. 江湖傳言
• 用 Deep Learning 就不需要做 feature selection,因為 deep learning
會自動幫你決定
• From Wikipedia (https://en.wikipedia.org/wiki/Deep_learning):
• “One of the promises of deep learning is replacing handcrafted features with
efficient algorithms for unsupervised or semi-supervised feature learning and
hierarchical feature extraction.”
• 真 的 有 這 麼 神 奇 嗎 ?
22. Feature selection for Iris Dataset as Example
• Iris dataset attributes
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica