Automatic Search Event-Summary

Automatic Search Event
by
Automatic Keyword Extraction
Xiwei Yan
08-10-2016

Overview
Ads landing pages Html source code Text
Keyword & Key PhrasesSimilar WebpagesAudience

Motivation
• Automate the search events (free BA from
manually generating the keywords)
• Identify users for campaigns that don’t have
pixels

Approach
• Preprocessing
• Keyword Extraction models
– TF-IDF
– TextRank
– Word2Vec + TextRank
– TextRank + Word2Vec

Approach 1 - TFIDF
• Preprocessing
• Lower case, lemmatize, stop words, punctuation, tokenization, tag and
filter by part-of-speech tags
– TF-IDF
• TF-IDF(w, d, n, N) = TF(w, d) * IDF(n, N)
– TF(w, d) = # times word w occurred
in doc d
– IDF(n, N) = # docs the word w appears
Word Term
freq in
doc1
Appear
in #
docs
Tfidf
car 27 3 0
auto 3 2 1.216
Insurance 0 2 0
Best 14 2 5.676

Approach 2 -
TextRank
• Preprocessing
Lower case, lemmatize, stop words, punctuation,
tokenization, tag and filter by part-of-speech tags
• Identify Structurally important Keyword
• Iteratively Calculate:
𝑆 𝑉𝑖 = 1 − 𝑑 + 𝑑 ∗
𝑗 ∈ 𝑛𝑔𝑏𝑟 𝑉 𝑖
1
𝑑𝑒𝑔𝑟𝑒𝑒 𝑉𝑗
𝑆 𝑉𝑗
d is the damping factor that usually set to 0.85

Approach 2 - TextRank
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
1
1
1
1
1
1
1
1
1
1
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
0.32
0.32
2.65
0.49
2.65
2.19
0.36
0.32
0.32
0.36
first
iteration
𝑆 𝑉𝑖 = 1 − 𝑑 + 𝑑 ∗
1
𝑆 𝑉𝑗
𝑆𝑐𝑜𝑟𝑒 𝑔𝑒𝑖𝑐𝑜 = 0.15 + 0.85 ∗
1
1
∗ 1 +
1
1
∗ 1 +
1
2
∗ 1 +
1
5
∗ 1 +
1
4
∗ 1 = 2.65
service call auto insurance policy
𝑆𝑐𝑜𝑟𝑒 𝑝𝑜𝑙𝑖𝑐𝑦 = 0.15 + 0.85 ∗
1
1
∗ 1 +
1
1
∗ 1 +
1
5
∗ 1 +
1
5
∗ 1 = 2.19
find privacy insurance geico
5
5
4
2
1
1
1
1
1
1
𝑆𝑐𝑜𝑟𝑒 𝑠𝑒𝑟𝑣𝑖𝑐𝑒 = 0.15 + 0.85 ∗
1
5
∗ 1 = 0.32
geico
iterations
d is the damping factor
that usually set to 0.85

Approach 2 - TextRank
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
0.51
0.51
2.12
0.87
2.12
1.77
0.52
0.51
0.51
0.52
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
0.51
0.51
2.12
0.87
2.65
1.75
0.52
0.51
0.51
0.52
Converge
𝑆 𝑉𝑖 = 1 − 𝑑 + 𝑑 ∗
1
𝑆 𝑉𝑗
service call auto insurance policy
𝑆𝑐𝑜𝑟𝑒 𝑝𝑜𝑙𝑖𝑐𝑦 = 0.15 + 0.85 ∗
1
1
∗ 0.52 +
1
1
∗ 0.52 +
1
5
∗ 2.12 +
1
5
∗ 2.12 = 1.75
find privacy insurance geico
5
5
4
2
1
1
1
1
1
1
𝑆𝑐𝑜𝑟𝑒 𝑠𝑒𝑟𝑣𝑖𝑐𝑒 = 0.15 + 0.85 ∗
1
5
∗ 2.12 = 0.51
geico
10
iterations
Converge Really Quick!
(<= 20 iterations)
d is the damping factor
that usually set to 0.85
𝑆𝑐𝑜𝑟𝑒 𝑔𝑒𝑖𝑐𝑜 = 0.15 + 0.85 ∗
1
1
∗ 0.51 +
1
1
∗ 0.51 +
1
2
∗ 0.87 +
1
5
∗ 2.12 +
1
4
∗ 1.77 = 2.12

Approach 3 – Word2vec + ?
• Preprocessing
• No preprocessing (ideally)
– Word2Vec + Clustering
Projection
matrix

0
1
0
0
.
.
.
0
0
.
.
.
0
0
1
0
0
0
.
.
.
0
0
0
0
1
0
0
0
0
0
0
1
.
.
.
.9
.8
.1
.
.
.
.
.1
5V*1
W(t)
W(1)
W(t-1)
W(2)
.…..
D*V
D*1
Continuous Bag-of-Words Model
+
Negative SamplingThe
cat
on
that
Projection
Matrix W
sits
cover
sample
input
predict
learn
believe
type
five
design
human
Cost Function:
log 𝑝 𝑠𝑖𝑡𝑠 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝑡ℎ𝑒, 𝑐𝑎𝑡, 𝑜𝑛, 𝑡ℎ𝑎𝑡
=
exp(𝑜 𝑠𝑖𝑡𝑠 )
𝑘=1
𝐾
exp(𝑜(𝑤 𝑘 ! = 𝑠𝑖𝑡𝑠)
Backpropagation:
𝛿 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 = 𝑦 − 𝑦
𝛿 ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟 = 𝛿 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 ∗ W′
Gradient Descent:
𝑤𝑖
′
𝑛𝑒𝑤 = 𝑤𝑖
′
𝑜𝑙𝑑 − 𝛼 ∗
𝛿 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 ∗ hi
𝑤𝑖 𝑛𝑒𝑤 = 𝑤𝑖 𝑜𝑙𝑑 − 𝛼 ∗
𝛿 ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟
softmax
0.366
0.2
0.103
0.100
0.009
0.011
0.045
0.050
0.070
0.010
0.009Projection
Matrix W’

Approach 3 – Word2vec + Clustering
• k-means
• DBSCAN

Approach 3 – Word2vec + TextRank
W(1)
N*D
W(2)
W(3)
W(4)
W(n-2)
W(n-1)
W(n)
…
…
…
…
…
…
…
…
…
…
…
…
…
…
john
deere
compact
utility
tractor
taylor
messick
Inc
..
..
..
..
..
..
..
..
company
profile
agricultural
equipment
tractor
tillage
mower
excavator
sprayer
shredder
agriculture
harvest
mower
excavator
shredder
tillage
harvest
sprayer
Document Text
Trained Word2vec Model
TextRank
• Identify semantically important Keyword

Approach 4 –TextRank + Word2vec
Word TextRank
Score
tractor 0.015847
john 0.013281
sale 0.012494
standard 0.012474
equipment 0.010799
power 0.009747
messick 0.008162
new 0.008151
work 0.007907
series 0.007707
mower 0.006099
utility 0.006035
compact 0.005751
TextRank Result
mower 0.8502
excavator 0.7708
shredder 0.7451
tillage 0.7341
harvest 0.7154
sprayer 0.7101
Word2vec Similarity
Word New Score
tractor 0.015847
mower 0.015847*0.8502= 0.013433
john 0.013281
sale 0.012494
standard 0.012474
excavator 0.015847*0.7708= 0.012215
shredder 0.015847*0.7451= 0.011808
tillage 0.015847*0.7341= 0.011633
harvest 0.015847*0.7154= 0.011337
sprayer 0.015847*0.7101= 0.011253
equipment 0.010799
power 0.009747
messick 0.008162
new 0.008151

Google’s Pre-trained Word2vec
Campaign % Words in Pre-trained
Model Vocab.
% Keywords in Pre-trained
Model Vocab.
Geico 0.929985 0.88888
Taylor Messick (Agricultural
Equipment)
0.929784 0.41176
Trane (AC) 0.922018 0.71428

Model Testing
1. Generate keyword from the 4 models
2. Feed into Lucene and find urls
3. Track the audience who visited these urls
4. Compare the audience we find to the audience
the pixels find

Results (Dell) - Keyword
TFIDF TextRank Word2vec_Textrank TextRank_Word2vec
office dell outlet dell
dellcom support collaboration acquire
view service acquire laptop
electronics product work desktop
customer price purchase software
dell use spare rebate
representative software poster welding
dellcomreturnspolicy customer transformation windows
dells system apg corporations
information practices new dell please dell software
prosupport dell dell inc poster laptop desktop
products view dell outlet apg transformation dell new
services support dell dell today purchase acquire dell tablet
dell sales dell team spare transformation dell inc

Results (Toyota) - Keyword
TFIDF TextRank Word2vec_Textrank TextRank_Word2vec
highlander toyota generate toyota
kbbcom information acquire preowned
edmundscom site misuse certified
certify vehicle tale highlander
information use govern rav
certification program tradein yaris
site email fourwheel avalon
program service generate tale corolla
assistance sale rubbed bologna sequoia
violated please toyota site identify tundra
hybrid highlander toyota vehicle wheel camry
car certification toyota dealer rubbed tale venza
personal information new toyota help toyota vehicle
cruiser preowned toyota certified new avalon preowned

Results - Urls
Dell Toyota
http://thetechjournal.com/electronics/laptop/
dell-inspiron-15r-laptop.xhtml
http://www.adverts.ie/laptop-parts-and-
accessories/dell-laptop-charger-19-5v-4-62a-
90w/10838435
http://www.dellservicecentreinchennai.in/tab
let-repair-center-medavakkam.html
http://www.dell.com/us/business/p/powered
ge-c6320p/pd?oc=&model_id=poweredge-
c6320p&l=en&s=bsd
http://forum.notebookreview.com/threads/d
ell-2012-outlet-coupons.636641/page-21
http://www.macdonaldtoyota.ca/
http://www.stcharlestoyota.net
http://www.baldwintoyotaofpoplarbluf
f.com/
http://www.lafontainetoyota.com/
http://www.cedarrapidstoyota.com/
http://www.craigtoyota.com/
http://www.planettoyotaonline.com/
http://www.gatewaytoyotapierre.com/

Result - % of Converters
CampaignId TFIDF TextRank TextRank_
Word2vec
Word2vec_
TextRank
13405 25 (0.2%) 99 (0.8%) 44 (0.4%) 1 (0.008%)
13553 229 (3.2%) 269 (3.7%) 252 (3.5%) 8 (0.1%)
14099 6 (0.03%) 57 (0.3%) 16 (0.08%) 2 (0.01%)
14545 247 (3%) 250 (3%) 482 (5.7%) 7 (0.08%)
15077 0 (0%) 4 (0.02%) 15 (0.08%) 6 (0.03%)

Conclusion
• TextRank and TextRank_Word2vec
consistently perform better than TFIDF
• TextRank don’t require extra space for model
saving
• All 3 models need O(n) computational time

0
1
0
0
.
.
.
0
0
.
.
.
0
0
1
0
0
0
.
.
.
0
0
0
0
1
0
.
.
.
0
0
0
0
1
0
0
0
0
0
0
1
.
.
.
.1
.3
.7
.4
.9
.
.
.2
.01
.9
.2
.
.
.
.4
.5
.9
.8
.1
.
.
.
.
.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5V*1
W(t)
W(1)
W(t-1)
W(2)
.…..
D*V
5D*1
.
.
.
.
.
.
.
.
.
.
tanh
Hidden
Layer
0.003
.
.
.
.
.
.
.
.
.
.
.
0.000
0.009
0.011
0.045
0.000
0.000
0.366
0.010
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.010
0.000
0.000
Apple
.
.
.
.
.
.
.
.
.
.
.
Computer
point
traffic
inbox
policy
print
couch
choice
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
choose
later
media
Output
layer
softmax
Most
Computation
Neural Net
Language
Model
Maximize
1
𝑇
𝑡
log 𝑝 𝑤𝑡, 𝑤𝑡−1, … , 𝑤𝑡−𝑛+1; 𝜃 + 𝑅(𝜃)
Time Complexity
𝑁 ∗ 𝐷 + 𝑁 ∗ 𝐷 ∗ 𝐻 + 𝐻 ∗ 𝑉
The
cat
sits
on
that
Projection
Matrix

0
1
0
0
.
.
.
0
0
.
.
.
0
0
1
0
0
0
.
.
.
0
0
0
0
1
0
.
.
.
0
0
0
0
1
0
0
0
0
0
0
1
.
.
.
.1
.3
.7
.4
.9
.
.
.2
.01
.9
.2
.
.
.
.4
.5
.9
.8
.1
.
.
.
.
.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5V*1
W(t)
W(1)
W(t-1)
W(2)
.…..
D*V
5D*1
.
.
.
.
.
.
.
.
.
.
tanh
Hidden
Layer
Hierarchical Probabilistic
Neural Net
Language Model
The
cat
sits
on
that
Projection
Matrix
TV
Computer
couch
table
make
choose
print
write

0
1
0
0
.
.
.
0
0
.
.
.
0
0
1
0
0
0
.
.
.
0
0
0
0
1
0
0
0
0
0
0
1
.
.
.
.9
.8
.1
.
.
.
.
.1
5V*1
W(t)
W(1)
W(t-1)
W(2)
.…..
D*V
D*1
Continuous
Bag-of-Words
Model
The
cat
on
that
Projection
Matrix
TV
Computer
couch
table
make
choose
sits
crawl

Automatic Search Event-Summary

Recommended

Recommended

More Related Content

Similar to Automatic Search Event-Summary

Similar to Automatic Search Event-Summary (20)

Automatic Search Event-Summary

Editor's Notes