Automatic Web Tagging and
Person Tagging
Using Language Models
- Qiaozhu Mei†
, Yi Zhang‡
Presented by Jessica Gronski‡
†
...
Tagging a Web Document
• The dual problem of search/retrieval:
[Mei et al. 2007]
– Retrieval: short description (query)  ...
Social Bookmarking of Web
Documents
3
Web documents
Social bookmarks (tags)
Existing Work on Social Bookmarking
• Social Bookmarking Systems
– Del.icio.us, Digg, Citeulike, etc.
• Enhance Social boo...
Research Questions
• Can we automatically generate tags for web
documents?
– Meaningful, compact, relevant
• Can we genera...
Applications of Automatic Tagging
• Summarizing documents/ web objects
• Suggest social bookmarks
• Refine queries for web...
7
Rest of the Talk
• A probabilistic approach to tag generation
– Candidate Tag Selection
– Web document representation
– ...
Our Method
8
data 0.1599
statistics 0.0752
tutorial 0.0660
analysis 0.0372
software 0.0311
model 0.0310
frequent 0.0233
pr...
Candidate Tag Selection
• Meaningful, compact, user-oriented
• From social bookmarking data
– E.g., Del.icio.us
– Single t...
Representation of Web Documents
• Multinomial distribution of words
(unigram language models)
– Commonly used in retrieval...
Tag Ranking: A Probabilistic Approach
• Web documents d  a language model
• A candidate tag t  a language model from its...
Rewriting and Efficient Computation
12
),()||(
)|()|(
)|,(
log)|(
)|(
),|(
log)|(
)|(
)|(
log)|(
)|(
),|(
log)|(
)|(
)|(
l...
Tagging Web Users
• Summarize the interests and bias of a user
• Web user  a pseudo document
• Estimate a language model ...
Experiments
• Dataset:
– Two-week tagging records from Del.icio.us
– Candidate tags:
• Top 15,000 Significant 2-grams from...
Tagging Web Documents
15
Urls LM p(w|d) Tag = Word Tag = bigram Tag =
wikipedia
title
http://kuler.adobe.com/
(158 bookmar...
Tagging Web Documents (Cont.)
16
Urls LM p(w|d) Tag = Word Tag = bigram Tag =
wikipedia title
http://pipes.yahoo.com
(386 ...
Tagging Web Users
17
Users LM p(w|d) Tag = bigram Tag = wikipedia title
User 1 photography
art
portraits
tools
web
design
...
Tagging Web Users (Cont.)
18
Users LM p(w|d) Tag = bigram Tag = wikipedia title
User 3 games
arg
tools
programming
sudoku
...
Discussions
• Using top tags: too general, sometimes not relevant
• Ranking tags by labeling language models:
– Candidate ...
Summary
• Automatic tagging of web documents and web
users
• A probabilistic approach based on labeling
language models
• ...
Thanks!
21
Upcoming SlideShare
Loading in …5
×

Automatic Web Tagging and Person Tagging Using Language Models

573 views
505 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
573
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • A brief introduction of tagging. Is to give a brief description (usually words and phrases) to a web document.
    A dual problem to retrieval, to find good tags for documents
  • Traditionally tagging are done on desktop. Recently it goes online: social bookmarking cites, so that other people can see what you tag and what you use to tag.  example: delicious. It generates a large volume of tagging records every day.
  • Existing work on studying social bookmarking  mainly looking at how to enhance system and how to utilize tags, not on how to suggest good tags, or how to help people select tags/ automatically generate tags
  • A good tag for a document should satisfy three criteria: meaningful (e.g., a phrase). Compact  words, phrases; relevant  can characterize the target document. Not too general, not too narrow.
  • Assuming we have a corpus of user-generated text. For example delicious data or wikipedia.
    Find a good representation for the web document. In our work we use a multinomial distribution of words.
    Generate a set of candidate tags from the user-generated corpus, so that the candidates are meaningful and compact
    Rank the candidate labels based on the document representation. So that relevant tags are ranked on top.
    The basic idea of the ranking is to estimate another word distribution for each tag, and compare it with the document word distribution.
  • Three ways of choosing candidate tags.
  • We use a Multinomial distribution of words to represent a web document. Also known as a unigram language model. It is widely used in IR and text mining and achieves good performance. Briefly introduce what it is, and from the top words we can guess the semantics underneath. (so we can simply use the top words to tag a document. We use this as a baseline).
    The most important issue is to estimate a LM from the document. When we have the content of the document, we can use it to estimate. But if we don’t, we can still use the social bookmarks of that document – what other people used to tag the document – should represent the characteristics of that document.
  • Tag ranking: we use a probabilistic approach. Similar to Mei 07. Represent a doc as a language model. Suppose we can also represent the tag as another distribution, we can score the tag using the KL divergence of t and d. But how to estimate such a language model?
    We estimate the LM from the tags that co-occurs with the candidate tag in a social bookmark collection.
  • We can rewrite the KL divergence into a simpler presentation. Followed by the rewriting, we now have three components. Red one is the pointwise mutual information (actually its expectation) of the tag and a word. Second item (blue) is the KL divergence of the document and the social bookmarking corpus. It can be viewed as a bias. Indeed, when people are investigating delicious, they found that most tagged documents are high-tech docs. If we want to tag an literature page using tags in delicious, there would be some bias. The third term can be viewed as a bias of using the tagging corpus to figure out the semantics of a candidate tag.
    To simplify, we can drop those two terms, and only use the first term (assuming no bias). Therefore, the relevance of t and d can be rewritten into the expectation of pointwise MI of w and t given the social bookmarking corpus. From this rewriting, we can make the algorithm efficient, because all MIs can be precomputed from the corpus and stored. We can further improve the efficiency by dropping all PMI that are non-positive.
  • Similar to tagging documents. We can tag a user in order to summarize his interest, and his bias of view (e.g., a technical view? A political view?).
  • These four slides are a little trickier. Please feel free to ask if there’s anything not so clear. We have four strategies to generate tags – 1. do not use the KL-Div ranking, simply use the highest p(w|d). 2. KL-ranking + single tags from delicious; 3. KL-ranking + bigrams from delicious (statistically significant); 4. KL-ranking + wikipedia titles. So 1 is baseline, 2,3,4 varies on the selection of candidates.
    I intend to show that by using 1, there are many terms that are too general to capture the characteristics of the documents. And some only partially covers the semantics (blue). Basically they aren’t so relevant. By using 2, all are relevant (red), but some tags themselves are not so meaningful, or ambiguous (black). For example color, cor, and ajax. By using 3, all are relevant and more meaningful than 2 (red). But because we use statistic significant test to select tags, some are fake phrases (purple). E.g., xml youtube, color color, etc. By using 4, all tags are relevant, meaningful, and all are real phrases. Like web color, internet video, research video. But because they are from an outside corpus, there’s a vocabulary gap, we also missed quite a few good tags.
    In general, red are good ones and other colors are not so good.
  • Similar findings, but use different examples.
  • Using LM, some tags like “humor” only partially covers the semantics. Using ranking+bigram, some are very fake and funny phrases, like humor programming . Using wikipedia, many are better. We can see that the second user has two very different aspect of interests: network programming and geek humors. It’s to show that we can effectively tag users besides documents.
  • Similar but with different examples. Phrases in green are not real phrases, because delicious doesn’t reflect the order of tags input by the user. Potentially if we switch the order, they become much better tags. We can see that many bigram tags for user 4 are good. But those tags doesn’t appear in wikipedia. We lose them if we use an outside source like wikipedia.
  • Basically a summary of the four previous slides. Basic message is that the method can effectively generate tags for documents and users. Ranking tags by labeling language models using KL-div is much better than baseline. But there isn’t a unified story on candidate tag selection. All three have their pros and cons. Bigrams and wiki are better than single words.
  • Automatic Web Tagging and Person Tagging Using Language Models

    1. 1. Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei† , Yi Zhang‡ Presented by Jessica Gronski‡ † University of Illinois at Urbana-Champaign ‡ University of California at Santa Cruz
    2. 2. Tagging a Web Document • The dual problem of search/retrieval: [Mei et al. 2007] – Retrieval: short description (query)  relevant documents – Tagging: document  short description (tag) • To summarize the content of documents • To access the document in the future 2 Text Document Query/Tag retrieval tagging
    3. 3. Social Bookmarking of Web Documents 3 Web documents Social bookmarks (tags)
    4. 4. Existing Work on Social Bookmarking • Social Bookmarking Systems – Del.icio.us, Digg, Citeulike, etc. • Enhance Social bookmarking systems – Anti-spam [Koutrika et al 2007] – Search& ranking tags [Hotho et al 2006] • Utilize social bookmarks – Visualization [Dubinko et al. 2006] – Summarization [Boydell et al. 2007] – Use tags to help web search: [Heymann et al. 2008]; [Zhou et al. 2008] 4
    5. 5. Research Questions • Can we automatically generate tags for web documents? – Meaningful, compact, relevant • Can we generate tags for other web objects, such as web users? 5
    6. 6. Applications of Automatic Tagging • Summarizing documents/ web objects • Suggest social bookmarks • Refine queries for web search – Finding good queries to a document • Suggest good keywords for online advertising 6
    7. 7. 7 Rest of the Talk • A probabilistic approach to tag generation – Candidate Tag Selection – Web document representation – Tag ranking • Experiments – Web documents tagging; – web user tagging • Summary
    8. 8. Our Method 8 data 0.1599 statistics 0.0752 tutorial 0.0660 analysis 0.0372 software 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 algorithm 0.0173 … ipod nano, data mining, presidential campaign index structure, statistics tutorial, computer science… Candidate tag pool data mining 0.26 statistics tutorial 0.19 computer science 0.17 index structure 0.01 …… ipod nano 0.00001 presidential campaign 0.0 …… Ranking candidate tags User-Generated Corpus (e.g., Del.icio.us, Wikipedia) Web Documents Multinomial word Distribution representation
    9. 9. Candidate Tag Selection • Meaningful, compact, user-oriented • From social bookmarking data – E.g., Del.icio.us – Single tags  tags that other people used – “phrases”  statistically significant bigrams • From other user-generated web contents – E.g., Wikipedia – Titles of entries in wikipedia 9
    10. 10. Representation of Web Documents • Multinomial distribution of words (unigram language models) – Commonly used in retrieval and text mining • Can be estimated from the content of the document, or from social bookmarks (our approach) – What other people used to tag that document 10 text 0.16 mining 0.08 data 0.07 probabilistic 0.04 independence 0.03 model 0.03 … Baseline: Use the top words in that distribution to tag a document
    11. 11. Tag Ranking: A Probabilistic Approach • Web documents d  a language model • A candidate tag t  a language model from its co-occurring tags • Score and rank t by KL-divergence of these two language models 11 ∑=−= w dwp twp dwptdDdtf )|( )|( log)|()||(),( ),|( Ctwp Social Bookmark Collection
    12. 12. Rewriting and Efficient Computation 12 ),()||( )|()|( )|,( log)|( )|( ),|( log)|( )|( )|( log)|( )|( ),|( log)|( )|( )|( log)|(),( CtBiasCdD CtpCwp Ctwp dwp twp Ctwp dwp Cwp dwp dwp Cwp Ctwp dwp dwp twp dwpdtf w www w +−∝ −−= = ∑ ∑∑∑ ∑ )|,( CwtPMI Bias of using C to represent candidate tag t Bias of using C to represent document d (e.g., del.icio.us) )]|,([),( CtwPMIEdtf d rank = 1. Can be pre-computed from corpus; 2. Only store those PMI(w,t|C) > 0
    13. 13. Tagging Web Users • Summarize the interests and bias of a user • Web user  a pseudo document • Estimate a language model from all tags that he used • The rest is similar to web document tagging 13
    14. 14. Experiments • Dataset: – Two-week tagging records from Del.icio.us – Candidate tags: • Top 15,000 Significant 2-grams from del.icio.us; • titles of all wikipedia entries (5,836,166 entries, around 48,000 appeared in del.icio.us) 14 Time Span Bookmarks Distinct Tags Distinct Users 02/13/07 ~ 02/26/07 579,652 111,381 20,138
    15. 15. Tagging Web Documents 15 Urls LM p(w|d) Tag = Word Tag = bigram Tag = wikipedia title http://kuler.adobe.com/ (158 bookmarks) color design webdesign tools adobe graphics flash color colour palette colorscheme colours picker cor adobe color color design color colour color colors colour design inspiration palette webdesign color color colour palette web color colours cor rgb http://www.youtube.com/watch?v=6gmP4nk0EOE (157 bookmarks) web2.0 video youtube web internet xml community youtube revver vodcast primer comunidad participation ethnograpy xml youtube web2.0 youtube video web2.0 web2.0 xml online presentation social video youtube video internet video youtube revver research video vodcast primer p2p TV Too general, sometimes not relevant Relevant, precise Meaningful, relevant overfit data, not real phrases Meaningful, relevant Meaningful, relevant, real But partially covers good tags But sometimes not meaningful overfit data, not real phrases
    16. 16. Tagging Web Documents (Cont.) 16 Urls LM p(w|d) Tag = Word Tag = bigram Tag = wikipedia title http://pipes.yahoo.com (386 bookmarks) yahoo rss web2.0 mashup feeds programming pipes pipes feeds yahoo mashup rss syndication mashups feeds mashups mashup pipes web2.0 yahoo rss web2.0 mashup rss api feeds pipes prog- ramming pipes yahoo mashups rss syndication mashups blog feeds http://www.miniajax.com/ (349 bookmarks) ajax javascript web2.0 webdesign programming code webdev ajax dhtml javascript moo.fx dragdrop phototype autosuggest ajax code code javascript javascript ajax javascript web- 2.0 css ajax javascript pro- gramming ajax dhtml javascript moo.fx javascript li- brary javascript -framework Too general, sometimes not relevant Relevant, precise But sometimes not meaningful Meaningful, relevant overfit data, not real phrases Meaningful, relevant, real
    17. 17. Tagging Web Users 17 Users LM p(w|d) Tag = bigram Tag = wikipedia title User 1 photography art portraits tools web design geek art photography photography portraits digital flickr photoblog photography art photo flickr photography weblog wordpress art photography photoblog portraits photography landscapes flickr art contest User 2 humor programming photography blog webdesign security funny geek hack humor programming hack hacking networking programming geek html geek hacking reference security network programming tweak hacking security geek humor sysadmin digitalcamera Partially covers the interest Meaningful, relevant, real overfit data, not real phrases
    18. 18. Tagging Web Users (Cont.) 18 Users LM p(w|d) Tag = bigram Tag = wikipedia title User 3 games arg tools programming sudoku cryptography software arg games games puzzles games internet arg code games sudoku code generator community games arg games research games puzzles storytelling code generator community games User 4 web reference css development rubyonrails tools design rubyonrails web css development brower development development editor development forum development firefox javascript tools javascript css webdev xhtml dhtml css3 dom Missed many good tags
    19. 19. Discussions • Using top tags: too general, sometimes not relevant • Ranking tags by labeling language models: – Candidate = Social bookmarking words • Pros: relevant, compact • Cons: ambiguous, not so meaningful – Candidate = Social bookmarking bigrams • Pros: more meaningful, relevant • Cons: overfiting the data, sometimes not real phrases – Candidate = Wikipedia Titles: • Pros: meaningful, relevant real phrases • Cons: biased, missed potential good tags. (Bias(t, C)) 19
    20. 20. Summary • Automatic tagging of web documents and web users • A probabilistic approach based on labeling language models • Effective when the candidate tags are of high quality • Future work: – A robust way of generating candidate tags – Large scale evaluation 20
    21. 21. Thanks! 21

    ×