6. Approach 1 - TFIDF
• Preprocessing
• Lower case, lemmatize, stop words, punctuation, tokenization, tag and
filter by part-of-speech tags
• Keyword Extraction models
– TF-IDF
• TF-IDF(w, d, n, N) = TF(w, d) * IDF(n, N)
– TF(w, d) = # times word w occurred
in doc d
– IDF(n, N) = # docs the word w appears
Word Term
freq in
doc1
Appear
in #
docs
Tfidf
car 27 3 0
auto 3 2 1.216
Insurance 0 2 0
Best 14 2 5.676
7. Approach 2 -
TextRank
• Preprocessing
Lower case, lemmatize, stop words, punctuation,
tokenization, tag and filter by part-of-speech tags
• Identify Structurally important Keyword
• Iteratively Calculate:
𝑆 𝑉𝑖 = 1 − 𝑑 + 𝑑 ∗
𝑗 ∈ 𝑛𝑔𝑏𝑟 𝑉 𝑖
1
𝑑𝑒𝑔𝑟𝑒𝑒 𝑉𝑗
𝑆 𝑉𝑗
d is the damping factor that usually set to 0.85
14. Approach 4 –TextRank + Word2vec
Word TextRank
Score
tractor 0.015847
john 0.013281
sale 0.012494
standard 0.012474
equipment 0.010799
power 0.009747
messick 0.008162
new 0.008151
work 0.007907
series 0.007707
mower 0.006099
utility 0.006035
compact 0.005751
TextRank Result
mower 0.8502
excavator 0.7708
shredder 0.7451
tillage 0.7341
harvest 0.7154
sprayer 0.7101
Word2vec Similarity
Word New Score
tractor 0.015847
mower 0.015847*0.8502= 0.013433
john 0.013281
sale 0.012494
standard 0.012474
excavator 0.015847*0.7708= 0.012215
shredder 0.015847*0.7451= 0.011808
tillage 0.015847*0.7341= 0.011633
harvest 0.015847*0.7154= 0.011337
sprayer 0.015847*0.7101= 0.011253
equipment 0.010799
power 0.009747
messick 0.008162
new 0.008151
15. Google’s Pre-trained Word2vec
Campaign % Words in Pre-trained
Model Vocab.
% Keywords in Pre-trained
Model Vocab.
Geico 0.929985 0.88888
Taylor Messick (Agricultural
Equipment)
0.929784 0.41176
Trane (AC) 0.922018 0.71428
16. Model Testing
1. Generate keyword from the 4 models
2. Feed into Lucene and find urls
3. Track the audience who visited these urls
4. Compare the audience we find to the audience
the pixels find
17. Results (Dell) - Keyword
TFIDF TextRank Word2vec_Textrank TextRank_Word2vec
office dell outlet dell
dellcom support collaboration acquire
view service acquire laptop
electronics product work desktop
customer price purchase software
dell use spare rebate
representative software poster welding
dellcomreturnspolicy customer transformation windows
dells system apg corporations
information practices new dell please dell software
prosupport dell dell inc poster laptop desktop
products view dell outlet apg transformation dell new
services support dell dell today purchase acquire dell tablet
dell sales dell team spare transformation dell inc
18. Results (Toyota) - Keyword
TFIDF TextRank Word2vec_Textrank TextRank_Word2vec
highlander toyota generate toyota
kbbcom information acquire preowned
edmundscom site misuse certified
certify vehicle tale highlander
information use govern rav
certification program tradein yaris
site email fourwheel avalon
program service generate tale corolla
assistance sale rubbed bologna sequoia
violated please toyota site identify tundra
hybrid highlander toyota vehicle wheel camry
car certification toyota dealer rubbed tale venza
personal information new toyota help toyota vehicle
cruiser preowned toyota certified new avalon preowned
22. Conclusion
• TextRank and TextRank_Word2vec
consistently perform better than TFIDF
• TextRank don’t require extra space for model
saving
• All 3 models need O(n) computational time
Gensim tfidf model take care of normalization by document length
Sklearn tfidf model take care of normalization and pesudocount
# log+1 instead of log makes sure terms with zero idf don't get suppressed entirely.
# idf = np.log(float(n_samples) / df) + 1.0
Sklearn use natural log, while gensim tfidf use log2
Gensim tfidf model take care of normalization by document length
Sklearn tfidf model take care of normalization and pesudocount
# log+1 instead of log makes sure terms with zero idf don't get suppressed entirely.
# idf = np.log(float(n_samples) / df) + 1.0
Sklearn use natural log, while gensim tfidf use log2
250 word, 250 vertice
Both need to set arbitrary parameters, which is hard to determine and have to tune the parameter
Kmeans don’t cluster well with the model we trained
DBSCAN cluster better, but throw a lot of keywords as noise or all clustered as 1 big group, depending on the parameter set
Still did not solve the problem with generalization
Identify keywords that are either not in the document, or structurally less important in the document but semantically close to the more important keyword
Integrate the structural importance with the semantic importance