SlideShare a Scribd company logo
1 of 27
Automatic Search Event
by
Automatic Keyword Extraction
Xiwei Yan
08-10-2016
Overview
Ads landing pages Html source code Text
Keyword & Key PhrasesSimilar WebpagesAudience
Motivation
• Automate the search events (free BA from
manually generating the keywords)
• Identify users for campaigns that don’t have
pixels
A First Glimpse at Result
Approach
• Preprocessing
• Keyword Extraction models
– TF-IDF
– TextRank
– Word2Vec + TextRank
– TextRank + Word2Vec
Approach 1 - TFIDF
• Preprocessing
• Lower case, lemmatize, stop words, punctuation, tokenization, tag and
filter by part-of-speech tags
• Keyword Extraction models
– TF-IDF
• TF-IDF(w, d, n, N) = TF(w, d) * IDF(n, N)
– TF(w, d) = # times word w occurred
in doc d
– IDF(n, N) = # docs the word w appears
Word Term
freq in
doc1
Appear
in #
docs
Tfidf
car 27 3 0
auto 3 2 1.216
Insurance 0 2 0
Best 14 2 5.676
Approach 2 -
TextRank
• Preprocessing
Lower case, lemmatize, stop words, punctuation,
tokenization, tag and filter by part-of-speech tags
• Identify Structurally important Keyword
• Iteratively Calculate:
𝑆 𝑉𝑖 = 1 − 𝑑 + 𝑑 ∗
𝑗 ∈ 𝑛𝑔𝑏𝑟 𝑉 𝑖
1
𝑑𝑒𝑔𝑟𝑒𝑒 𝑉𝑗
𝑆 𝑉𝑗
d is the damping factor that usually set to 0.85
Approach 2 - TextRank
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
1
1
1
1
1
1
1
1
1
1
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
0.32
0.32
2.65
0.49
2.65
2.19
0.36
0.32
0.32
0.36
first
iteration
𝑆 𝑉𝑖 = 1 − 𝑑 + 𝑑 ∗
𝑗 ∈ 𝑛𝑔𝑏𝑟 𝑉 𝑖
1
𝑑𝑒𝑔𝑟𝑒𝑒 𝑉𝑗
𝑆 𝑉𝑗
𝑆𝑐𝑜𝑟𝑒 𝑔𝑒𝑖𝑐𝑜 = 0.15 + 0.85 ∗
1
1
∗ 1 +
1
1
∗ 1 +
1
2
∗ 1 +
1
5
∗ 1 +
1
4
∗ 1 = 2.65
service call auto insurance policy
𝑆𝑐𝑜𝑟𝑒 𝑝𝑜𝑙𝑖𝑐𝑦 = 0.15 + 0.85 ∗
1
1
∗ 1 +
1
1
∗ 1 +
1
5
∗ 1 +
1
5
∗ 1 = 2.19
find privacy insurance geico
5
5
4
2
1
1
1
1
1
1
𝑆𝑐𝑜𝑟𝑒 𝑠𝑒𝑟𝑣𝑖𝑐𝑒 = 0.15 + 0.85 ∗
1
5
∗ 1 = 0.32
geico
iterations
d is the damping factor
that usually set to 0.85
Approach 2 - TextRank
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
0.51
0.51
2.12
0.87
2.12
1.77
0.52
0.51
0.51
0.52
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
0.51
0.51
2.12
0.87
2.65
1.75
0.52
0.51
0.51
0.52
Converge
𝑆 𝑉𝑖 = 1 − 𝑑 + 𝑑 ∗
𝑗 ∈ 𝑛𝑔𝑏𝑟 𝑉 𝑖
1
𝑑𝑒𝑔𝑟𝑒𝑒 𝑉𝑗
𝑆 𝑉𝑗
service call auto insurance policy
𝑆𝑐𝑜𝑟𝑒 𝑝𝑜𝑙𝑖𝑐𝑦 = 0.15 + 0.85 ∗
1
1
∗ 0.52 +
1
1
∗ 0.52 +
1
5
∗ 2.12 +
1
5
∗ 2.12 = 1.75
find privacy insurance geico
5
5
4
2
1
1
1
1
1
1
𝑆𝑐𝑜𝑟𝑒 𝑠𝑒𝑟𝑣𝑖𝑐𝑒 = 0.15 + 0.85 ∗
1
5
∗ 2.12 = 0.51
geico
10
iterations
Converge Really Quick!
(<= 20 iterations)
d is the damping factor
that usually set to 0.85
𝑆𝑐𝑜𝑟𝑒 𝑔𝑒𝑖𝑐𝑜 = 0.15 + 0.85 ∗
1
1
∗ 0.51 +
1
1
∗ 0.51 +
1
2
∗ 0.87 +
1
5
∗ 2.12 +
1
4
∗ 1.77 = 2.12
Approach 3 – Word2vec + ?
• Preprocessing
• No preprocessing (ideally)
• Keyword Extraction models
– Word2Vec + Clustering
Projection
matrix
0
1
0
0
.
.
.
0
0
.
.
.
0
0
1
0
0
0
.
.
.
0
0
0
0
1
0
0
0
0
0
0
1
.
.
.
.9
.8
.1
.
.
.
.
.1
5V*1
W(t)
W(1)
W(t-1)
W(2)
.…..
D*V
D*1
Continuous Bag-of-Words Model
+
Negative SamplingThe
cat
on
that
Projection
Matrix W
sits
cover
sample
input
predict
learn
believe
type
five
design
human
Cost Function:
log 𝑝 𝑠𝑖𝑡𝑠 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝑡ℎ𝑒, 𝑐𝑎𝑡, 𝑜𝑛, 𝑡ℎ𝑎𝑡
=
exp(𝑜 𝑠𝑖𝑡𝑠 )
𝑘=1
𝐾
exp(𝑜(𝑤 𝑘 ! = 𝑠𝑖𝑡𝑠)
Backpropagation:
𝛿 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 = 𝑦 − 𝑦
𝛿 ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟 = 𝛿 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 ∗ W′
Gradient Descent:
𝑤𝑖
′
𝑛𝑒𝑤 = 𝑤𝑖
′
𝑜𝑙𝑑 − 𝛼 ∗
𝛿 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 ∗ hi
𝑤𝑖 𝑛𝑒𝑤 = 𝑤𝑖 𝑜𝑙𝑑 − 𝛼 ∗
𝛿 ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟
softmax
0.366
0.2
0.103
0.100
0.009
0.011
0.045
0.050
0.070
0.010
0.009Projection
Matrix W’
Approach 3 – Word2vec + Clustering
• k-means
• DBSCAN
Approach 3 – Word2vec + TextRank
W(1)
N*D
W(2)
W(3)
W(4)
W(n-2)
W(n-1)
W(n)
…
…
…
…
…
…
…
…
…
…
…
…
…
…
john
deere
compact
utility
tractor
taylor
messick
Inc
..
..
..
..
..
..
..
..
company
profile
agricultural
equipment
tractor
tillage
mower
excavator
sprayer
shredder
agriculture
harvest
mower
excavator
shredder
tillage
harvest
sprayer
Document Text
Trained Word2vec Model
TextRank
• Identify semantically important Keyword
Approach 4 –TextRank + Word2vec
Word TextRank
Score
tractor 0.015847
john 0.013281
sale 0.012494
standard 0.012474
equipment 0.010799
power 0.009747
messick 0.008162
new 0.008151
work 0.007907
series 0.007707
mower 0.006099
utility 0.006035
compact 0.005751
TextRank Result
mower 0.8502
excavator 0.7708
shredder 0.7451
tillage 0.7341
harvest 0.7154
sprayer 0.7101
Word2vec Similarity
Word New Score
tractor 0.015847
mower 0.015847*0.8502= 0.013433
john 0.013281
sale 0.012494
standard 0.012474
excavator 0.015847*0.7708= 0.012215
shredder 0.015847*0.7451= 0.011808
tillage 0.015847*0.7341= 0.011633
harvest 0.015847*0.7154= 0.011337
sprayer 0.015847*0.7101= 0.011253
equipment 0.010799
power 0.009747
messick 0.008162
new 0.008151
Google’s Pre-trained Word2vec
Campaign % Words in Pre-trained
Model Vocab.
% Keywords in Pre-trained
Model Vocab.
Geico 0.929985 0.88888
Taylor Messick (Agricultural
Equipment)
0.929784 0.41176
Trane (AC) 0.922018 0.71428
Model Testing
1. Generate keyword from the 4 models
2. Feed into Lucene and find urls
3. Track the audience who visited these urls
4. Compare the audience we find to the audience
the pixels find
Results (Dell) - Keyword
TFIDF TextRank Word2vec_Textrank TextRank_Word2vec
office dell outlet dell
dellcom support collaboration acquire
view service acquire laptop
electronics product work desktop
customer price purchase software
dell use spare rebate
representative software poster welding
dellcomreturnspolicy customer transformation windows
dells system apg corporations
information practices new dell please dell software
prosupport dell dell inc poster laptop desktop
products view dell outlet apg transformation dell new
services support dell dell today purchase acquire dell tablet
dell sales dell team spare transformation dell inc
Results (Toyota) - Keyword
TFIDF TextRank Word2vec_Textrank TextRank_Word2vec
highlander toyota generate toyota
kbbcom information acquire preowned
edmundscom site misuse certified
certify vehicle tale highlander
information use govern rav
certification program tradein yaris
site email fourwheel avalon
program service generate tale corolla
assistance sale rubbed bologna sequoia
violated please toyota site identify tundra
hybrid highlander toyota vehicle wheel camry
car certification toyota dealer rubbed tale venza
personal information new toyota help toyota vehicle
cruiser preowned toyota certified new avalon preowned
Results - Urls
Dell Toyota
http://thetechjournal.com/electronics/laptop/
dell-inspiron-15r-laptop.xhtml
http://www.adverts.ie/laptop-parts-and-
accessories/dell-laptop-charger-19-5v-4-62a-
90w/10838435
http://www.dellservicecentreinchennai.in/tab
let-repair-center-medavakkam.html
http://www.dell.com/us/business/p/powered
ge-c6320p/pd?oc=&model_id=poweredge-
c6320p&l=en&s=bsd
http://forum.notebookreview.com/threads/d
ell-2012-outlet-coupons.636641/page-21
http://www.macdonaldtoyota.ca/
http://www.stcharlestoyota.net
http://www.baldwintoyotaofpoplarbluf
f.com/
http://www.lafontainetoyota.com/
http://www.cedarrapidstoyota.com/
http://www.craigtoyota.com/
http://www.planettoyotaonline.com/
http://www.gatewaytoyotapierre.com/
Result - # of Converters
Result - % of Converters
CampaignId TFIDF TextRank TextRank_
Word2vec
Word2vec_
TextRank
13405 25 (0.2%) 99 (0.8%) 44 (0.4%) 1 (0.008%)
13553 229 (3.2%) 269 (3.7%) 252 (3.5%) 8 (0.1%)
14099 6 (0.03%) 57 (0.3%) 16 (0.08%) 2 (0.01%)
14545 247 (3%) 250 (3%) 482 (5.7%) 7 (0.08%)
15077 0 (0%) 4 (0.02%) 15 (0.08%) 6 (0.03%)
Conclusion
• TextRank and TextRank_Word2vec
consistently perform better than TFIDF
• TextRank don’t require extra space for model
saving
• All 3 models need O(n) computational time
Appendix
0
1
0
0
.
.
.
0
0
.
.
.
0
0
1
0
0
0
.
.
.
0
0
0
0
1
0
.
.
.
0
0
0
0
1
0
0
0
0
0
0
1
.
.
.
.1
.3
.7
.4
.9
.
.
.2
.01
.9
.2
.
.
.
.4
.5
.9
.8
.1
.
.
.
.
.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5V*1
W(t)
W(1)
W(t-1)
W(2)
.…..
D*V
5D*1
.
.
.
.
.
.
.
.
.
.
tanh
Hidden
Layer
0.003
.
.
.
.
.
.
.
.
.
.
.
0.000
0.009
0.011
0.045
0.000
0.000
0.366
0.010
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.010
0.000
0.000
Apple
.
.
.
.
.
.
.
.
.
.
.
Computer
point
traffic
inbox
policy
print
couch
choice
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
choose
later
media
Output
layer
softmax
Most
Computation
Neural Net
Language
Model
Maximize
1
𝑇
𝑡
log 𝑝 𝑤𝑡, 𝑤𝑡−1, … , 𝑤𝑡−𝑛+1; 𝜃 + 𝑅(𝜃)
Time Complexity
𝑁 ∗ 𝐷 + 𝑁 ∗ 𝐷 ∗ 𝐻 + 𝐻 ∗ 𝑉
The
cat
sits
on
that
Projection
Matrix
0
1
0
0
.
.
.
0
0
.
.
.
0
0
1
0
0
0
.
.
.
0
0
0
0
1
0
.
.
.
0
0
0
0
1
0
0
0
0
0
0
1
.
.
.
.1
.3
.7
.4
.9
.
.
.2
.01
.9
.2
.
.
.
.4
.5
.9
.8
.1
.
.
.
.
.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5V*1
W(t)
W(1)
W(t-1)
W(2)
.…..
D*V
5D*1
.
.
.
.
.
.
.
.
.
.
tanh
Hidden
Layer
Hierarchical Probabilistic
Neural Net
Language Model
The
cat
sits
on
that
Projection
Matrix
TV
Computer
couch
table
make
choose
print
write
0
1
0
0
.
.
.
0
0
.
.
.
0
0
1
0
0
0
.
.
.
0
0
0
0
1
0
0
0
0
0
0
1
.
.
.
.9
.8
.1
.
.
.
.
.1
5V*1
W(t)
W(1)
W(t-1)
W(2)
.…..
D*V
D*1
Continuous
Bag-of-Words
Model
The
cat
on
that
Projection
Matrix
TV
Computer
couch
table
make
choose
sits
crawl

More Related Content

Similar to Automatic Search Event-Summary

Continuous delivery in Pipedrive
Continuous delivery in PipedriveContinuous delivery in Pipedrive
Continuous delivery in PipedriveTomas Rehor
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
TDD with BizTalk
TDD with BizTalkTDD with BizTalk
TDD with BizTalkBen Carey
 
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...Joseph Alaimo Jr
 
My Dad Won't Buy Me DevOps
My Dad Won't Buy Me DevOpsMy Dad Won't Buy Me DevOps
My Dad Won't Buy Me DevOpsXebiaLabs
 
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5gdgsurrey
 
How to (Effectively) Measure Quality across Software Deliverables
How to (Effectively) Measure Quality across Software DeliverablesHow to (Effectively) Measure Quality across Software Deliverables
How to (Effectively) Measure Quality across Software DeliverablesTechWell
 
Closing the gap between development and production with Datadog and NerdVisio...
Closing the gap between development and production with Datadog and NerdVisio...Closing the gap between development and production with Datadog and NerdVisio...
Closing the gap between development and production with Datadog and NerdVisio...David Thacker
 
MeasureWorks - The Art of Staying Fast
MeasureWorks - The Art of Staying FastMeasureWorks - The Art of Staying Fast
MeasureWorks - The Art of Staying FastMeasureWorks
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...Flink Forward
 
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD MeetupKeptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD MeetupJürgen Etzlstorfer
 
Agile Development in .NET
Agile Development in .NETAgile Development in .NET
Agile Development in .NETdanhermes
 
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...Lucidworks
 
How to build real time price adjustments in vehicle insurance on Streams ( Do...
How to build real time price adjustments in vehicle insurance on Streams ( Do...How to build real time price adjustments in vehicle insurance on Streams ( Do...
How to build real time price adjustments in vehicle insurance on Streams ( Do...confluent
 
Overcoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystemOvercoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystemQAware GmbH
 
Agile Development From A Developers Perspective
Agile Development From A Developers PerspectiveAgile Development From A Developers Perspective
Agile Development From A Developers PerspectiveRichard Banks
 
How Trade Desk Built a Connected Team of 100+ Service Agents
How Trade Desk Built a Connected Team of 100+ Service AgentsHow Trade Desk Built a Connected Team of 100+ Service Agents
How Trade Desk Built a Connected Team of 100+ Service AgentsAtlassian
 
Learn to see, measure and automate with value stream management
Learn to see, measure and automate with value stream managementLearn to see, measure and automate with value stream management
Learn to see, measure and automate with value stream managementLance Knight
 
Continuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptnContinuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptnAndreas Grabner
 
Business Event Driven Architecture & Governance in Action
Business Event Driven Architecture & Governance in ActionBusiness Event Driven Architecture & Governance in Action
Business Event Driven Architecture & Governance in ActionHostedbyConfluent
 

Similar to Automatic Search Event-Summary (20)

Continuous delivery in Pipedrive
Continuous delivery in PipedriveContinuous delivery in Pipedrive
Continuous delivery in Pipedrive
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
TDD with BizTalk
TDD with BizTalkTDD with BizTalk
TDD with BizTalk
 
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
 
My Dad Won't Buy Me DevOps
My Dad Won't Buy Me DevOpsMy Dad Won't Buy Me DevOps
My Dad Won't Buy Me DevOps
 
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
 
How to (Effectively) Measure Quality across Software Deliverables
How to (Effectively) Measure Quality across Software DeliverablesHow to (Effectively) Measure Quality across Software Deliverables
How to (Effectively) Measure Quality across Software Deliverables
 
Closing the gap between development and production with Datadog and NerdVisio...
Closing the gap between development and production with Datadog and NerdVisio...Closing the gap between development and production with Datadog and NerdVisio...
Closing the gap between development and production with Datadog and NerdVisio...
 
MeasureWorks - The Art of Staying Fast
MeasureWorks - The Art of Staying FastMeasureWorks - The Art of Staying Fast
MeasureWorks - The Art of Staying Fast
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
 
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD MeetupKeptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
Keptn: Unbreakable Continuous Delivery - Berlin CI/CD Meetup
 
Agile Development in .NET
Agile Development in .NETAgile Development in .NET
Agile Development in .NET
 
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...
Art and Science Come Together When Mastering Relevance Ranking - Tom Burgmans...
 
How to build real time price adjustments in vehicle insurance on Streams ( Do...
How to build real time price adjustments in vehicle insurance on Streams ( Do...How to build real time price adjustments in vehicle insurance on Streams ( Do...
How to build real time price adjustments in vehicle insurance on Streams ( Do...
 
Overcoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystemOvercoming (organizational) scalability issues in your Prometheus ecosystem
Overcoming (organizational) scalability issues in your Prometheus ecosystem
 
Agile Development From A Developers Perspective
Agile Development From A Developers PerspectiveAgile Development From A Developers Perspective
Agile Development From A Developers Perspective
 
How Trade Desk Built a Connected Team of 100+ Service Agents
How Trade Desk Built a Connected Team of 100+ Service AgentsHow Trade Desk Built a Connected Team of 100+ Service Agents
How Trade Desk Built a Connected Team of 100+ Service Agents
 
Learn to see, measure and automate with value stream management
Learn to see, measure and automate with value stream managementLearn to see, measure and automate with value stream management
Learn to see, measure and automate with value stream management
 
Continuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptnContinuous Delivery and Automated Operations on k8s with keptn
Continuous Delivery and Automated Operations on k8s with keptn
 
Business Event Driven Architecture & Governance in Action
Business Event Driven Architecture & Governance in ActionBusiness Event Driven Architecture & Governance in Action
Business Event Driven Architecture & Governance in Action
 

Automatic Search Event-Summary

  • 1. Automatic Search Event by Automatic Keyword Extraction Xiwei Yan 08-10-2016
  • 2. Overview Ads landing pages Html source code Text Keyword & Key PhrasesSimilar WebpagesAudience
  • 3. Motivation • Automate the search events (free BA from manually generating the keywords) • Identify users for campaigns that don’t have pixels
  • 4. A First Glimpse at Result
  • 5. Approach • Preprocessing • Keyword Extraction models – TF-IDF – TextRank – Word2Vec + TextRank – TextRank + Word2Vec
  • 6. Approach 1 - TFIDF • Preprocessing • Lower case, lemmatize, stop words, punctuation, tokenization, tag and filter by part-of-speech tags • Keyword Extraction models – TF-IDF • TF-IDF(w, d, n, N) = TF(w, d) * IDF(n, N) – TF(w, d) = # times word w occurred in doc d – IDF(n, N) = # docs the word w appears Word Term freq in doc1 Appear in # docs Tfidf car 27 3 0 auto 3 2 1.216 Insurance 0 2 0 Best 14 2 5.676
  • 7. Approach 2 - TextRank • Preprocessing Lower case, lemmatize, stop words, punctuation, tokenization, tag and filter by part-of-speech tags • Identify Structurally important Keyword • Iteratively Calculate: 𝑆 𝑉𝑖 = 1 − 𝑑 + 𝑑 ∗ 𝑗 ∈ 𝑛𝑔𝑏𝑟 𝑉 𝑖 1 𝑑𝑒𝑔𝑟𝑒𝑒 𝑉𝑗 𝑆 𝑉𝑗 d is the damping factor that usually set to 0.85
  • 8. Approach 2 - TextRank geico auto insurance policy privacy find car coverage call sevice 1 1 1 1 1 1 1 1 1 1 geico auto insurance policy privacy find car coverage call sevice 0.32 0.32 2.65 0.49 2.65 2.19 0.36 0.32 0.32 0.36 first iteration 𝑆 𝑉𝑖 = 1 − 𝑑 + 𝑑 ∗ 𝑗 ∈ 𝑛𝑔𝑏𝑟 𝑉 𝑖 1 𝑑𝑒𝑔𝑟𝑒𝑒 𝑉𝑗 𝑆 𝑉𝑗 𝑆𝑐𝑜𝑟𝑒 𝑔𝑒𝑖𝑐𝑜 = 0.15 + 0.85 ∗ 1 1 ∗ 1 + 1 1 ∗ 1 + 1 2 ∗ 1 + 1 5 ∗ 1 + 1 4 ∗ 1 = 2.65 service call auto insurance policy 𝑆𝑐𝑜𝑟𝑒 𝑝𝑜𝑙𝑖𝑐𝑦 = 0.15 + 0.85 ∗ 1 1 ∗ 1 + 1 1 ∗ 1 + 1 5 ∗ 1 + 1 5 ∗ 1 = 2.19 find privacy insurance geico 5 5 4 2 1 1 1 1 1 1 𝑆𝑐𝑜𝑟𝑒 𝑠𝑒𝑟𝑣𝑖𝑐𝑒 = 0.15 + 0.85 ∗ 1 5 ∗ 1 = 0.32 geico iterations d is the damping factor that usually set to 0.85
  • 9. Approach 2 - TextRank geico auto insurance policy privacy find car coverage call sevice 0.51 0.51 2.12 0.87 2.12 1.77 0.52 0.51 0.51 0.52 geico auto insurance policy privacy find car coverage call sevice 0.51 0.51 2.12 0.87 2.65 1.75 0.52 0.51 0.51 0.52 Converge 𝑆 𝑉𝑖 = 1 − 𝑑 + 𝑑 ∗ 𝑗 ∈ 𝑛𝑔𝑏𝑟 𝑉 𝑖 1 𝑑𝑒𝑔𝑟𝑒𝑒 𝑉𝑗 𝑆 𝑉𝑗 service call auto insurance policy 𝑆𝑐𝑜𝑟𝑒 𝑝𝑜𝑙𝑖𝑐𝑦 = 0.15 + 0.85 ∗ 1 1 ∗ 0.52 + 1 1 ∗ 0.52 + 1 5 ∗ 2.12 + 1 5 ∗ 2.12 = 1.75 find privacy insurance geico 5 5 4 2 1 1 1 1 1 1 𝑆𝑐𝑜𝑟𝑒 𝑠𝑒𝑟𝑣𝑖𝑐𝑒 = 0.15 + 0.85 ∗ 1 5 ∗ 2.12 = 0.51 geico 10 iterations Converge Really Quick! (<= 20 iterations) d is the damping factor that usually set to 0.85 𝑆𝑐𝑜𝑟𝑒 𝑔𝑒𝑖𝑐𝑜 = 0.15 + 0.85 ∗ 1 1 ∗ 0.51 + 1 1 ∗ 0.51 + 1 2 ∗ 0.87 + 1 5 ∗ 2.12 + 1 4 ∗ 1.77 = 2.12
  • 10. Approach 3 – Word2vec + ? • Preprocessing • No preprocessing (ideally) • Keyword Extraction models – Word2Vec + Clustering Projection matrix
  • 11. 0 1 0 0 . . . 0 0 . . . 0 0 1 0 0 0 . . . 0 0 0 0 1 0 0 0 0 0 0 1 . . . .9 .8 .1 . . . . .1 5V*1 W(t) W(1) W(t-1) W(2) .….. D*V D*1 Continuous Bag-of-Words Model + Negative SamplingThe cat on that Projection Matrix W sits cover sample input predict learn believe type five design human Cost Function: log 𝑝 𝑠𝑖𝑡𝑠 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝑡ℎ𝑒, 𝑐𝑎𝑡, 𝑜𝑛, 𝑡ℎ𝑎𝑡 = exp(𝑜 𝑠𝑖𝑡𝑠 ) 𝑘=1 𝐾 exp(𝑜(𝑤 𝑘 ! = 𝑠𝑖𝑡𝑠) Backpropagation: 𝛿 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 = 𝑦 − 𝑦 𝛿 ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟 = 𝛿 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 ∗ W′ Gradient Descent: 𝑤𝑖 ′ 𝑛𝑒𝑤 = 𝑤𝑖 ′ 𝑜𝑙𝑑 − 𝛼 ∗ 𝛿 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 ∗ hi 𝑤𝑖 𝑛𝑒𝑤 = 𝑤𝑖 𝑜𝑙𝑑 − 𝛼 ∗ 𝛿 ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟 softmax 0.366 0.2 0.103 0.100 0.009 0.011 0.045 0.050 0.070 0.010 0.009Projection Matrix W’
  • 12. Approach 3 – Word2vec + Clustering • k-means • DBSCAN
  • 13. Approach 3 – Word2vec + TextRank W(1) N*D W(2) W(3) W(4) W(n-2) W(n-1) W(n) … … … … … … … … … … … … … … john deere compact utility tractor taylor messick Inc .. .. .. .. .. .. .. .. company profile agricultural equipment tractor tillage mower excavator sprayer shredder agriculture harvest mower excavator shredder tillage harvest sprayer Document Text Trained Word2vec Model TextRank • Identify semantically important Keyword
  • 14. Approach 4 –TextRank + Word2vec Word TextRank Score tractor 0.015847 john 0.013281 sale 0.012494 standard 0.012474 equipment 0.010799 power 0.009747 messick 0.008162 new 0.008151 work 0.007907 series 0.007707 mower 0.006099 utility 0.006035 compact 0.005751 TextRank Result mower 0.8502 excavator 0.7708 shredder 0.7451 tillage 0.7341 harvest 0.7154 sprayer 0.7101 Word2vec Similarity Word New Score tractor 0.015847 mower 0.015847*0.8502= 0.013433 john 0.013281 sale 0.012494 standard 0.012474 excavator 0.015847*0.7708= 0.012215 shredder 0.015847*0.7451= 0.011808 tillage 0.015847*0.7341= 0.011633 harvest 0.015847*0.7154= 0.011337 sprayer 0.015847*0.7101= 0.011253 equipment 0.010799 power 0.009747 messick 0.008162 new 0.008151
  • 15. Google’s Pre-trained Word2vec Campaign % Words in Pre-trained Model Vocab. % Keywords in Pre-trained Model Vocab. Geico 0.929985 0.88888 Taylor Messick (Agricultural Equipment) 0.929784 0.41176 Trane (AC) 0.922018 0.71428
  • 16. Model Testing 1. Generate keyword from the 4 models 2. Feed into Lucene and find urls 3. Track the audience who visited these urls 4. Compare the audience we find to the audience the pixels find
  • 17. Results (Dell) - Keyword TFIDF TextRank Word2vec_Textrank TextRank_Word2vec office dell outlet dell dellcom support collaboration acquire view service acquire laptop electronics product work desktop customer price purchase software dell use spare rebate representative software poster welding dellcomreturnspolicy customer transformation windows dells system apg corporations information practices new dell please dell software prosupport dell dell inc poster laptop desktop products view dell outlet apg transformation dell new services support dell dell today purchase acquire dell tablet dell sales dell team spare transformation dell inc
  • 18. Results (Toyota) - Keyword TFIDF TextRank Word2vec_Textrank TextRank_Word2vec highlander toyota generate toyota kbbcom information acquire preowned edmundscom site misuse certified certify vehicle tale highlander information use govern rav certification program tradein yaris site email fourwheel avalon program service generate tale corolla assistance sale rubbed bologna sequoia violated please toyota site identify tundra hybrid highlander toyota vehicle wheel camry car certification toyota dealer rubbed tale venza personal information new toyota help toyota vehicle cruiser preowned toyota certified new avalon preowned
  • 19. Results - Urls Dell Toyota http://thetechjournal.com/electronics/laptop/ dell-inspiron-15r-laptop.xhtml http://www.adverts.ie/laptop-parts-and- accessories/dell-laptop-charger-19-5v-4-62a- 90w/10838435 http://www.dellservicecentreinchennai.in/tab let-repair-center-medavakkam.html http://www.dell.com/us/business/p/powered ge-c6320p/pd?oc=&model_id=poweredge- c6320p&l=en&s=bsd http://forum.notebookreview.com/threads/d ell-2012-outlet-coupons.636641/page-21 http://www.macdonaldtoyota.ca/ http://www.stcharlestoyota.net http://www.baldwintoyotaofpoplarbluf f.com/ http://www.lafontainetoyota.com/ http://www.cedarrapidstoyota.com/ http://www.craigtoyota.com/ http://www.planettoyotaonline.com/ http://www.gatewaytoyotapierre.com/
  • 20. Result - # of Converters
  • 21. Result - % of Converters CampaignId TFIDF TextRank TextRank_ Word2vec Word2vec_ TextRank 13405 25 (0.2%) 99 (0.8%) 44 (0.4%) 1 (0.008%) 13553 229 (3.2%) 269 (3.7%) 252 (3.5%) 8 (0.1%) 14099 6 (0.03%) 57 (0.3%) 16 (0.08%) 2 (0.01%) 14545 247 (3%) 250 (3%) 482 (5.7%) 7 (0.08%) 15077 0 (0%) 4 (0.02%) 15 (0.08%) 6 (0.03%)
  • 22. Conclusion • TextRank and TextRank_Word2vec consistently perform better than TFIDF • TextRank don’t require extra space for model saving • All 3 models need O(n) computational time
  • 23.
  • 25. 0 1 0 0 . . . 0 0 . . . 0 0 1 0 0 0 . . . 0 0 0 0 1 0 . . . 0 0 0 0 1 0 0 0 0 0 0 1 . . . .1 .3 .7 .4 .9 . . .2 .01 .9 .2 . . . .4 .5 .9 .8 .1 . . . . .1 . . . . . . . . . . . . . . . . 5V*1 W(t) W(1) W(t-1) W(2) .….. D*V 5D*1 . . . . . . . . . . tanh Hidden Layer 0.003 . . . . . . . . . . . 0.000 0.009 0.011 0.045 0.000 0.000 0.366 0.010 . . . . . . . . . . . . . . . . . . . . 0.010 0.000 0.000 Apple . . . . . . . . . . . Computer point traffic inbox policy print couch choice . . . . . . . . . . . . . . . . . . . . choose later media Output layer softmax Most Computation Neural Net Language Model Maximize 1 𝑇 𝑡 log 𝑝 𝑤𝑡, 𝑤𝑡−1, … , 𝑤𝑡−𝑛+1; 𝜃 + 𝑅(𝜃) Time Complexity 𝑁 ∗ 𝐷 + 𝑁 ∗ 𝐷 ∗ 𝐻 + 𝐻 ∗ 𝑉 The cat sits on that Projection Matrix

Editor's Notes

  1. Gensim tfidf model take care of normalization by document length Sklearn tfidf model take care of normalization and pesudocount # log+1 instead of log makes sure terms with zero idf don't get suppressed entirely. # idf = np.log(float(n_samples) / df) + 1.0 Sklearn use natural log, while gensim tfidf use log2
  2. Gensim tfidf model take care of normalization by document length Sklearn tfidf model take care of normalization and pesudocount # log+1 instead of log makes sure terms with zero idf don't get suppressed entirely. # idf = np.log(float(n_samples) / df) + 1.0 Sklearn use natural log, while gensim tfidf use log2
  3. 250 word, 250 vertice
  4. Both need to set arbitrary parameters, which is hard to determine and have to tune the parameter Kmeans don’t cluster well with the model we trained DBSCAN cluster better, but throw a lot of keywords as noise or all clustered as 1 big group, depending on the parameter set Still did not solve the problem with generalization
  5. Identify keywords that are either not in the document, or structurally less important in the document but semantically close to the more important keyword Integrate the structural importance with the semantic importance