Dopo una panoramica sugli aspetti correlati all'inserimento di messaggi pubblicitari in pagine Web, il seminario illustra alcune tecniche innovative di analisi testuale. In particolare, tali tecniche saranno incentrate sul suggerimento di annunci pubblicitari inerenti il contesto della pagina visualizzata.
19. A MODERN CONTEXTUAL ADVERTISING SYSTEM
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
20. SYNTACTIC TEXTUAL ANALYSIS
Text Summarization
Bag of Words Representation
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
21. SYNTACTIC TEXTUAL ANALYSIS
Text summarization
State of the art techniques
First and Last Paragraph (FLP)
Title, First and Last Paragraph (TFLP)
Snippet (S)
Title and Snippet (TS)
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
22. SYNTACTIC TEXTUAL ANALYSIS
First and Last Paragraph (FLP)
You don’t need to shell out thousands,
survive various ballots, or swap a family
member for a ticket to enjoy the 2012
Summer Olympic Games this year. There's
all manner of free events and associated
shenanigans taking place in London and
across the UK to mark the occasion. Here
are ten ways to join in without spending any
money.
Indulge in a family feast
Volunteer chefs at 24 Sure Start Centres
across the UK are preparing to dish up free
delights throughout the period. Details,
along with all the other events that make up
the Cultural Olympiad, are available on the
site.
http://www.roughguides.com/website/Travel/SpotLight/ViewSpotLight.aspx?spotLightID=575
23. SYNTACTIC TEXTUAL ANALYSIS
Title, First and Last Paragraph (TFLP)
You don’t need to shell out thousands,
survive various ballots, or swap a family
member for a ticket to enjoy the 2012
Summer Olympic Games this year. There's
all manner of free events and associated
shenanigans taking place in London and
across the UK to mark the occasion. Here
are ten ways to join in without spending any
money.
Indulge in a family feast
Volunteer chefs at 24 Sure Start Centres
across the UK are preparing to dish up free
delights throughout the period. Details,
along with all the other events that make up
the Cultural Olympiad, are available on the
site.
http://www.roughguides.com/website/Travel/SpotLight/ViewSpotLight.aspx?spotLightID=575
24. SYNTACTIC TEXTUAL ANALYSIS
Title, First and Last Paragraph (TFLP)
You don’t need to shell out thousands,
survive various ballots, or swap a family
member for a ticket to enjoy the 2012
Summer Olympic Games this year. There's
all manner of free events and associated
shenanigans taking place London 2012 – Ten ways to celebrate the Olympics for free
in London and
across the UK to mark the occasion. Here
are ten ways to join in without spending any
money.
Indulge in a family feast
Volunteer chefs at 24 Sure Start Centres
across the UK are preparing to dish up free
delights throughout the period. Details,
along with all the other events that make up
the Cultural Olympiad, are available on the
site.
http://www.roughguides.com/website/Travel/SpotLight/ViewSpotLight.aspx?spotLightID=575
25. SYNTACTIC TEXTUAL ANALYSIS
Snippet (S)
http://www.roughguides.com/website/Travel/SpotLight/ViewSpotLight.aspx?spotLightID=575
26. SYNTACTIC TEXTUAL ANALYSIS
Title and Snippet (TS)
http://www.roughguides.com/website/Travel/SpotLight/ViewSpotLight.aspx?spotLightID=575
27. SYNTACTIC TEXTUAL ANALYSIS
Bag of Words (BoW) representation
Dimensionality reduction
Stop-words removal
Stemming
Vector representation
Set of pairs <word, occurrences>
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
28. SYNTACTIC TEXTUAL ANALYSIS
Stop-words removal
You don’t need to shell out thousands,
survive various ballots, or swap a
family member for a ticket to enjoy the
2012 Summer Olympic Games this
year. There's all manner of free events
and associated shenanigans taking
place in London and across the UK to
mark the occasion. Here are ten ways
to join in without spending any money.
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
29. SYNTACTIC TEXTUAL ANALYSIS
Stop-words removal
X X X X X
You don’t need to shell out thousands,
survive various ballots, X swap X
or a
family member forX ticket X enjoy the
Xa to X
2012 Summer Olympic Games this X
X X
year. There's all manner X free events
of
X
and associated shenanigans taking
placeX London and across the UK X
in X X to
X are
mark the occasion. Here X ten ways
X
to joinX without spending any money.
X in X X
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
30. SYNTACTIC TEXTUAL ANALYSIS
Stop-words removal
X X X X X
You don’t need to shell out thousands,
survive various ballots, X swap X
or a
family member forX ticket X enjoy the
Xa to X
2012 Summer Olympic Games this X
X X
year. There's all manner X free events
of
X
and associated shenanigans taking
Shell thousands, survive various
placeX London and across the UK X
in X X to swap family member ticket
ballots,
X are enjoy 2012 Summer Olympic Games
mark the occasion. Here X ten ways
X
to joinX without spending any money.
X in X X year. Manner free events associated
shenanigans taking place London
across UK mark occasion. ten ways
join spending money.
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
31. SYNTACTIC TEXTUAL ANALYSIS
Stemming
Shell thousands, survive various
ballots, swap family member ticket
enjoy 2012 Summer Olympic Games
year. Manner free events associated
shenanigans taking place London
across UK mark occasion. ten ways
join spending money.
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
32. SYNTACTIC TEXTUAL ANALYSIS
Stemming
Shell thousands, survive various
X
X
ballots, swap family member ticket
X
X
enjoy 2012 Summer Olympic Games X
year. Manner free events associated
X X
X
shenanigans taking place London
X
across UK mark occasion. ten ways
X X
X
join spending money.
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
33. SYNTACTIC TEXTUAL ANALYSIS
Stemming
Shell thousands, survive various
X
X
ballots, swap family member ticket
X
X
enjoy 2012 Summer Olympic Games X
year. Manner free events associated
X X
X
shenanigans taking place London
X
across UK mark occasion. ten ways
X X
Shell thousand, surviv various ballot,
X
join spending money.
swap famil member ticket enjoy 2012
Summer Olymp Game year. Manner
free event associat shenanigan tak
place London across UK mark
occasion. ten way join spend money.
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
35. IS ENOUGH THE SOLE SYNTACTIC APPROACH?
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
36. IS ENOUGH THE SOLE SYNTACTIC APPROACH?
Polysemy…
“BASS”
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
37. IS ENOUGH THE SOLE SYNTACTIC APPROACH?
Synonymity…
Vehicle Machine
Car Auto
Automobile
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
38. SEMANTIC TEXTUAL ANALYSIS
Taxonomy-based Classification
Word Disambiguation
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
40. SEMANTIC TEXTUAL ANALYSIS
Rocchio
Each centroid is defined as a sum of TF-IDF values of each
term, normalized by the number of webpages in the class
The classification is based on the
cosine of the angle between the
webpage and the centroid of each class
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
41. SEMANTIC TEXTUAL ANALYSIS
SVM
The score is related to the
distance of the webpage from a
separation hyperplane
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
42. SEMANTIC TEXTUAL ANALYSIS
Word Disambiguation
Bag of Concepts (BoC) representation
Adopted lexical supports
WordNet
YAGO
ConceptNet
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
43. SEMANTIC TEXTUAL ANALYSIS
WordNet
A large lexical database
of English. Nouns, verbs,
adjectives and adverbs
are grouped into sets of
cognitive synonyms
(synsets), each
expressing a distinct
concept.
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
44. SEMANTIC TEXTUAL ANALYSIS
YAGO
A semantic knowledge base, derived from Wikipedia,
WordNet and GeoNames
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
45. SEMANTIC TEXTUAL ANALYSIS
ConceptNet
A network of concepts connected by several semantic
relations (e.g., “IsA”, “PartOf”)
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
48. MATCHING
o Cosine similarity
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
49. MATCHING
o Jaccard index
The Jaccard coefficient measures similarity between sample sets,
and is defined as the size of the intersection divided by the size of
the union of the sample sets
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
50. MATCHING
Ranking
Adopted approaches
Simple ranking according to the calculated scores
Learning to rank model
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
51. MATCHING
o Learning to rank model
Pointwise approach
o Each query-document pair in the training data has a numerical
or ordinal score
o Regression problem approach: given a single query-document
pair, predict its score
Pairwise approach
o Classification problem approach: learning a binary classifier
which can tell which document is better in a given pair of
documents
Listwise approach
o Optimization problem approach: try to directly optimize the
value of one of the above evaluation measures, averaged over
all queries in the training data
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
57. SYNTACTICAL ANALYSIS
Results
FLP TFLP S TS
π 0.745 0.832 0.734 0.806
ρ 0.719 0.801 0.730 0.804
F1 0.732 0.816 0.732 0.805
#t 24 26 12 14
Adding information about the title improves the
performances
TFLP has the best performance
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
58. SEMANTIC ANALYSIS
Semantic approaches comparison
Anagnostopoulos et al. (2007) system vs Armano et al.
(2011-TIR) vs ConCA
Matching function
( p, a) simBoC (1 ) simCF
Comparison metric
N k
TP
i 1 j 1
ij
@k N k
(TP FP )
i 1 j 1
ij ij
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
59. SEMANTIC ANALYSIS
Ad repository
Built by hand by a domain expert
Taxonomy
BankSearch Dataset
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
62. SEMANTIC ANALYSIS
Results
Slight improvement by using concepts
Low values of α → CF more impact then BoC
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
63. SYNTACTICAL ANALYSIS VS SEMANTIC ANALYSIS
Contextual Advertising System
Armano et al. (2011-TIR)
Matching function
( p, a) simBoW (1 ) simCF
Comparisons varying α
α = 1 → pure syntax
α = 0 → pure semantics
Comparison metric
N k
TP
i 1 j 1
ij
@k N k
(TP FP )
i 1 j 1
ij ij
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
64. SYNTACTICAL ANALYSIS VS SEMANTIC ANALYSIS
Ad repository
Built by hand by a domain expert
Taxonomy
BankSearch Dataset
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
67. CONCLUSIONS
Online advertising
represents one of the major sources of income for a large
number of websites
is aimed at suggesting products and services to the
population of Internet users
Modern contextual advertising systems
put ads within the content of a generic, third party,
webpage
adopt both syntactical and semantic textual analyses to
select the most relevant ads for a given webpage
an example is ConCA
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
68. CONCLUSIONS
Results show that
the impact of semantics is stronger than that of syntax
adopting more advanced semantic techniques, such as
concepts, improves the performances
the more the suggested ads are, the worse the
performance is
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
70. REFERENCES
Syntactical Textual Analysis
Armano G., Giuliani A., & Vargiu E. Experimenting text summarization
techniques for contextual advertising. 2nd Italian Information Retrieval
Workshop (IIR’11) , 2011.
Armano G., Giuliani A. & Vargiu, E. Using snippets in text summarization: a
comparative study and an application. 3rd Italian Information Retrieval
Workshop (IIR’12), 2012.
Kolcz A., Prabakarmurthi V. & Kalita J. Summarization as feature selection for
text categorization. 10th International Conference on Information and
Knowledge Management (CIKM’01). ACM, New York, NY, USA, pp. 365–370,
2001.
Porter M. An algorithm for suffix stripping. Program 14, 3, 130–137, 1980.
Salton G., Wong A. & Yang C.S, A vector space model for automatic indexing,
Communications of the ACM, 18, 11, pp.613-620, 1975.
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
71. REFERENCES
Semantic Textual Analysis
Cortes C. & Vapnik, V.N. Support-Vector Networks, Machine Learning, 20,
1995.
Fellbaum C. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT
Press, 1998.
Liu H. & Singh P. ConceptNet: A practical commonsense reasoning tool-kit. BT
Technology Journal 22, pp. 211–226, 2004.
Miller G.A. WordNet: A Lexical Database for English. Communications of the
ACM, 38, 11, pp. 39-41, 1995.
Rocchio J. The SMART Retrieval System: Experiments in Automatic Document
Processing. PrenticeHall, Chapter: Relevance feedback in information
retrieval, pp. 313–323, 1971.
Suchanek F.M., Kasneci G. & Weikum G. Yago - A Core of Semantic
Knowledge. 16th International World Wide Web conference (WWW 2007),
2007.
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012
72. REFERENCES
Matching
Liu T.Y. Learning to rank for information retrieval. Found. Trends Inf. Retr. 3, 3,
pp. 225–331, 2009.
Radomski P.J. & Goeman, T.J. The homogenizing of Minnesota lake fish
assemblages. Fisheries, 20, pp. 20–23, 1995.
Comparison Systems
Anagnostopoulos A., Broder A. Z., Gabrilovich E., Josifovski V. & Riedel L. Just-
in-time contextual advertising. 16th ACM Conference on Information and
Knowledge Management (CIKM’07). ACM, New York, NY, USA, pp. 331–340,
2007.
Armano G., Giuliani A. & Vargiu E. Studying the impact of text summarization
on contextual advertising. 8th International Workshop on Text-based
Information Retrieval (TIR’11), 2011.
Armano G., Giuliani A. & Vargiu E. Semantic enrichment of contextual
advertising by using concepts. International Conference on Knowledge
Discovery and Information Retrieval, 2011.
Eloisa Vargiu (evargiu@bdigital.org) – Cagliari, 6 September 2012