Your SlideShare is downloading. ×
0
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Paper Presentation for INF 384H (http://courses.ischool.utexas.edu/Lease_Matt/2011/Fall/INF384H/)

255

Published on

Based on the paper: Heymann, Koutrika, and Garcia-Molina. 2008. Can Social Bookmarking Improve Web Search?

Based on the paper: Heymann, Koutrika, and Garcia-Molina. 2008. Can Social Bookmarking Improve Web Search?

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
255
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Can Social Bookmarking Improve Web Search? Ashish Jain Information Retrieval Paper Presentation
  • 2. Outline1 Introduction2 Terminology3 Collection of Data4 Related Work5 URLs Result 1 (Positive) Result 2 (Positive) Result 3 (Positive) Result 4 (Positive) Result 5 (Positive) Result 8 (Negative) Result 9 (Negative)6 Tags Result 6 (Positive) Result 7 (Positive) Result 10 (Negative) Result 11 (Negative)7 Discussion
  • 3. IntroductionWhat is social bookmarking?Show video (http://www.commoncraft.com/video/social-bookmarking). Ashish Jain (INF384H) Social Bookmarking Paper Presentation 3 / 51
  • 4. Introduction Figure: Major types of data used by search enginesAshish Jain (INF384H) Social Bookmarking Paper Presentation 4 / 51
  • 5. IntroductionWhat information does del.icio.us have?Lots of < url, tag , user > tuples.How can del.icio.us information help a search engine? If the URLs are unknown to a search engine, they can be added to the list of URLs to be crawled. Vocabulary problem: Users use different words to refer to the same information. For example, a user searching for pain killers might enter the query “analgesic”. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 5 / 51
  • 6. IntroductionPossibilitiesSuppose K represents known to a search engine and U represents unknownto a search engine. Tags (K) Tags (U) URLs (K) Both known Tags unknown URLs (U) URLs unknown Both Tags and URLs unknownWhen will del.icio.us information be useful to a search engine? When the URLs of del.icio.us is not a subset of the URLs crawled by a search engine. Tags given to a particular web page are not present in the URL, title, content of a web page. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 6 / 51
  • 7. IntroductionAuthors are trying to find answers to the following questions: How often do we find “non-obvious” tags? Is del.icio.us really more up-to-date than a search engine? What coverage does delicious have of the web? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 7 / 51
  • 8. TerminologyDefinitions Triple A triple is a < useri , tagj , urlk > tuple, signifying that user i has tagged URL k with tag j. Post A post is a URL bookmarked by a user and the associated meta data. A post is made up of many triples, though it may contained information like a user comment. Label A label is a < tagi , urlk > pair that signifies that at least one triple containing tag i and URL k exists in the system. Host Full host part of a URL example in http://i.stanford.edu/index.html, i.stanford.edu is the host. Domain Institutional level part of the host example in http://i.stanford.edu/index.html, stanford.edu is the domain. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 8 / 51
  • 9. Collection of Data Possible SourcesDel.icio.us Interfaces “Recent” feed provides the most recent bookmarks posted to del.icio.us in real time All posts for a given URL All posts by a given user Most recent posts with a given tagCrawlAlternatively, one can crawl del.icio.us treating it as a tripartite graph ofusers, URLs and tags. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 9 / 51
  • 10. Collection of Data Datasets(C)rawl (R)ecent (M)onthLarge scale crawl of Data gathered using Data gathered fromdel.icio.us in del.icio.us recent feed del.icio.us recent feedSeptember 2006. interface for nearly 8 interface for one months beginning complete month September 28, 2006. starting May 25, 2007. Gathering process enhanced so more accurate than the R dataset. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 10 / 51
  • 11. Collection of DataComparison (C)rawl (R)ecent (M)onth Posts ≈ 22M ≈ 11M ≈ 3.6M Unique URLs ≈ 1.3M ≈ 3M ≈ 2.5M Disadvantage Biased towards Missing data Missing data popular URLs, tags, users Ashish Jain (INF384H) Social Bookmarking Paper Presentation 11 / 51
  • 12. Collection of DataQuery DatasetAOL Query DatasetAbout 20 million search queries by roughly 650,000 usersUsed to simulate distribution of queries that a search engine might receive. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 12 / 51
  • 13. URLs Figure: OverviewAshish Jain (INF384H) Social Bookmarking Paper Presentation 13 / 51
  • 14. URLs Result 1 (Positive) Result 1AimAre pages posted to del.icio.us often recently modified? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 14 / 51
  • 15. URLs Result 1 (Positive) MethodologyModification Date of a Web page As we studied in previous papers, determining the exact modification date of a web page is hard. The search engines have to estimate the modification date of a web page in order to crawl the web efficiently. Yahoo! Search API gives the modification date of a web page. Authors use the same to determine the modification date of a web page. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 15 / 51
  • 16. URLs Result 1 (Positive) MethodologyCompare del.icio.us Pages sampled from del.icio.us recent feed as they were postedYahoo! 1, 10, and 100 The top 1, 10, and 100 results (respectively) of Yahoo! searches for queries sampled from the AOL query dataset. ODP Pages sampled from the Open Directory Project (dmoz.org) Ashish Jain (INF384H) Social Bookmarking Paper Presentation 16 / 51
  • 17. URLs Result 1 (Positive)Results Pages from del.icio.us are often more recently modified than ODP Found a correlation between a search result being ranked higher and a result having been modified more recently. Top 10 results from Yahoo! Search were about the same age as the pages found bookmarked in del.icio.us .Conclusiondel.icio.us users post interesting pages that are actively updated or havebeen recently created. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 17 / 51
  • 18. URLs Result 2 (Positive) Result 2AimHow many pages belonging to del.icio.us are not known to a search engine? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 18 / 51
  • 19. URLs Result 2 (Positive)Methodology Sample pages from the del.icio.us feed as they were posted, and then run searches on those pages immediately after. Of those pages, about 42.5% were not found. This could be due to several reasons: Page is indexed under another canonicalized URL Could be spam Could be an odd MIME-type for example an image Page could not have been found yet Continuously search for the web page in the next four weeks. If found assume it was not indexed. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 19 / 51
  • 20. URLs Result 2 (Positive)Result Out of 5,724 URLS which were sampled and were missing, 1,750 were later found. Implies roughly 30% of the missing URLs were new URLs. Implies 12.5% of del.icio.us i.e. 42.5% × 30%.Conclusiondel.icio.us can serve as a (small) data source for new web pages and tohelp crawl ordering. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 20 / 51
  • 21. URLs Result 2 (Positive) Figure: Result 2Ashish Jain (INF384H) Social Bookmarking Paper Presentation 21 / 51
  • 22. URLs Result 3 (Positive)AimCheck coverage of search results by del.icio.us Ashish Jain (INF384H) Social Bookmarking Paper Presentation 22 / 51
  • 23. URLs Result 3 (Positive)Methodology Sample queries from AOL dataset based on query event frequency (Implies biased towards popular queries). Run query on Yahoo! Search Intersect search results with datasets C, M, R. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 23 / 51
  • 24. URLs Result 3 (Positive)Results For the top 100 results, del.icio.us covers 9% of the results returned for a set of over 30,000 queries. For the top 10 results, del.icio.us covers 19% of the results returned.Conclusiondel.icio.us users are disproportionately common in search results comparedto their coverage. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 24 / 51
  • 25. URLs Result 4 (Positive)Q. Are there some subset of users responsible for most of the data indel.icio.us ? On social news sites, it is commonly cited that the majority of front page posts come from a dedicated group of less than 100 users. del.icio.us does exhibit some of these traits but it is not as dependent on some relatively small group of users. The top 10% only account for 56% of the posts. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 25 / 51
  • 26. URLs Result 4 (Positive) Figure: Result 4Ashish Jain (INF384H) Social Bookmarking Paper Presentation 26 / 51
  • 27. URLs Result 5 (Positive)How much of the new information added to del.icio.us is new? Estimated using dataset M. A new post in dataset M was not in del.icio.us 40% of the time. Should be about 30% after adjusting for filtering (How did they come up with this number is not known!) How often is a completely new domain added to del.icio.us? 12% of posts in Dataset M were URLs whose domains were not in either Dataset C or R. Implies about 1/8th of the time Ashish Jain (INF384H) Social Bookmarking Paper Presentation 27 / 51
  • 28. URLs Result 5 (Positive) Figure: Result 5Ashish Jain (INF384H) Social Bookmarking Paper Presentation 28 / 51
  • 29. URLs Result 8 (Negative)AimHow many URLs are posted to del.icio.us every day? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 29 / 51
  • 30. URLs Result 8 (Negative)Methodology Plot the posts for every hour in Dataset M and compare the same with data collected by Philipp Keller a . The two are mutually reinforcing. Also plot posts from dataset R. a http://deli.ckoma.net/stats (Defunct website) Ashish Jain (INF384H) Social Bookmarking Paper Presentation 30 / 51
  • 31. URLs Result 8 (Negative)Results About 92,000 posts per day of each weekend About 133,000 posts per weekday Implies about 851,000 posts per week About 44 million posts per year a a There are about 1.5 million blog posts per dayConclusion Compared to blog posts, the number of posts per day is small about 1/10 Posting rate on del.icio.us is marked by a series of increases followed by periods of relative stability. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 31 / 51
  • 32. URLs Result 9 (Negative)AimWhat is the size of del.icio.us ? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 32 / 51
  • 33. URLs Result 9 (Negative)Methodology Divide time into three sets. t1 Period before Schacter’s announcement on May 24th a t2 May 24th and start of Philipp Keller’s data gathering t3 Start of Philipp Keller’s data gathering to the present t1 + t2 + t3 = (400, 000) + (p1 × db × f ) + (nk × f + mk × dk × f ) Equal to about 117 million posts b Reasonable estimate should be between 60 and 150 million posts.c Estimate between 20 and 50 percent of posts are unique URLs. a Joshua Schacter, creator of del.icio.us ,announced in May, 2004 that there were400,000 posts and 200,000 URLs. b Most likely an overestimate as the authors chose upper bound values for db and dk . c It does not include private posts Ashish Jain (INF384H) Social Bookmarking Paper Presentation 33 / 51
  • 34. URLs Result 9 (Negative)Results There are about 115 million public posts a . There are about 30-50 million unique URLs. a They estimate that there are between 60 and 150 million posts. 115 million is notan average of 60 and 150 million!ConclusionThe number of total posts is relatively small compared to the web as awhole. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 34 / 51
  • 35. URLs Result 9 (Negative) Figure: Result 9Ashish Jain (INF384H) Social Bookmarking Paper Presentation 35 / 51
  • 36. Tags Result 6 (Positive)AimIs there any correlation between tags and queries? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 36 / 51
  • 37. Tags Result 6 (Positive)Methodology Checked the tag-query overlap between the tags in dataset M and the query terms in the AOL query dataset. 22% of the AOL query dataset is made up of queries. Removed those. Removed certain stop word like tags from dataset M. Plotted number of times a tag occurs in Dataset M versus the number of times it occurs in the AOL query dataset. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 37 / 51
  • 38. Tags Result 6 (Positive)Figure: A scatter plot of tag count versus query count for top tags and queries indel.icio.us and AOL query dataset Ashish Jain (INF384H) Social Bookmarking Paper Presentation 38 / 51
  • 39. Tags Result 6 (Positive)Results One of the top 100, 500, and 1000 tags occurred in 8.6%, 25.3%, 36.8% of these non-domain, non-URL queries.Conclusiondel.icio.us may be able to help with queries where tags overlap with queryterms. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 39 / 51
  • 40. Tags Result 7 (Positive)AimAre the tags in del.icio.us of good quality? Are they non-sensical tags like“cool”, “fi32”, etc. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 40 / 51
  • 41. Tags Result 7 (Positive)Methodology: User Study 10 people (graduate students and “mix of individuals associated with our department”) manually evaluate posts to determine their quality. Sampled one post out of every five hundred, and then gave blocks of posts for individuals to label. Most individuals labeled 100 to 150 posts. For each tag, we asked whether the tag was “relevant”, “applies to the whole document,” and/or “subjective.” Bar for relevance was set low: whether a random person would agree that it was reasonable to say that the tag described the page. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 41 / 51
  • 42. Tags Result 7 (Positive)Results Only about 7% were deemed subjective (less than one in twenty for all users) No “spam”ConclusionTags on the whole are of good quality. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 42 / 51
  • 43. Tags Result 10 (Negative)AimDo people use tags which are not obvious from the context? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 43 / 51
  • 44. Tags Result 10 (Negative)Methodology Randomly pick 20,000 posts from Dataset M. Convert HTML to text. Also look at page text of pages that link to the URL in question (backlinks) and pages that are linked from the URL in question (forward links). Extract tokens. Check whether pages are in English or not. Lower case all tags and tokens. Compare Ashish Jain (INF384H) Social Bookmarking Paper Presentation 44 / 51
  • 45. Tags Result 10 (Negative)Results 50% of the time tag is in the page text 16% of the time it is in the title itself 20% of the time it’ll appear in three places: the page it annotates, at least one of its backlinks, at least one of its forward links. 80% of the time, tags will appear in one of three places: the page, its backlinks, its forward links. The tags in the other 20% seem to be of lower quality: misspellings, confusing tagging schemes (food/dining).Conclusion Most tags can be discovered by a search engine Ashish Jain (INF384H) Social Bookmarking Paper Presentation 45 / 51
  • 46. Tags Result 11 (Negative)AimAre some domains strongly correlated with particular tags and vice-versa? Ashish Jain (INF384H) Social Bookmarking Paper Presentation 46 / 51
  • 47. Tags Result 11 (Negative)ExampleTable: This example lists the five hosts in Dataset C with the most URLsannotated with the tag java. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 47 / 51
  • 48. Tags Result 11 (Negative)Methodology Used Dataset C which is highly biased towards popular URLs, tags and users. Therefore, the results of this experiment do not necessarily apply to del.icio.us as a whole. Build a simple binary classifier and see how it does. Figure: Function for classification Ashish Jain (INF384H) Social Bookmarking Paper Presentation 48 / 51
  • 49. Tags Result 11 (Negative)ResultDomains are often highly correlated with particular tags and vice-versa.ConclusionIt may be more efficient to train librarians to label domains than to askusers to tag pages. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 49 / 51
  • 50. Discussion SummaryAdvantages Actively updated Prominent in search results Tags are relevant and objectiveDisadvantages Small amount of data Tags in titles, page text, URLs Not good enough to be used by major search engines. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 50 / 51
  • 51. DiscussionDiscussion Personalized search using del.icio.us bookmarks. I found the conclusions drawn in subsection Result 1 hard to believe. I found the conclusions drawn in subsection Result 5 hard to believe. I found the conclusions drawn in subsection Result 11 hard to believe. Ashish Jain (INF384H) Social Bookmarking Paper Presentation 51 / 51
  • 52. DiscussionHeymann, Koutrika, and Garcia-Molina. 2008. Can SocialBookmarking Improve Web Search? WSDM 2008.Ashish Jain (INF384H) Social Bookmarking Paper Presentation 51 / 51

×