Your SlideShare is downloading. ×
  • Like
sigir_faceted2006.ppt
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

sigir_faceted2006.ppt

  • 165 views
Published

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
165
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Zipfian nature of the term frequency distribution, this function tends to favor terms that have already high frequencies in the original database. Terms with high frequencies demonstrate higher increases in frequency, even if they are less popular in the expanded database compared to the original one. The inverse problem appears if we use ratios instead of differences. To avoid the shortcomings of this approach, we introduce a rank-based metric that measures the differences in the ranking of the terms
  • To identify the candidate facet terms, we identify terms that were rather infrequent in the original database, but are frequent in the database with the expanded documents. In particular, our algorithm proceeds as follows:

Transcript

  • 1. Automatic Discovery of Useful Facet Terms Wisam Dakka – Columbia University Rishabh Dayal – Columbia University Panagiotis G. Ipeirotis – NYU
  • 2. Searching the NYT Archive for Book Research
  • 3. Motivation: News Archive
    • Accessing and searching is not an easy task
      • Researchers and reporters spend a large amount of time going through their long query results
      • News archives are huge and available for tens of years
      • Many relevant results
        • Results in the first page are not more relevant than the results in the 5 th or the 10 th page (NYT archive)
      • Search engines of news archive mainly follow the paradigm
        • Search, skim through long results, modify, and search again
    • Goal: Multifaceted Interfaces (MI) over the news archive of Newsblaster
    • Newsblaster archive
      • About 6 years of news from 24 news sources
      • Stories are clustered daily into hierarchies of topics and events
      • Events are threaded over time, summarized, and classified
  • 4. Motivation: MI for Newsblaster Archive
    • Our multifaceted interfaces work has some limitations [CIKM2005]:
      • Supervised learning: facets that could be identified by our algorithm appear in the training set
    • WordNet hypernyms
        • WordNet has rather poor coverage of named entities
    • Free text collections
      • The quality of the hierarchies built on top of news stories was low.
  • 5. Challenge: Automatic Extraction of the Useful Facets from News Archive
    • Automatically discover, in an unsupervised manner, a set of candidate facet terms from free text
    • Automatically group together facet terms that belong to the same facet
    • Build the appropriate browsing structure for each facet
  • 6. Intuition: Look for Facet Terms Elsewhere
    • Pilot study - 100 stories from The NYTimes
      • Common facets: Location , Institutes, History, People, Social Phenomenon, Markets, Nature, and Event
      • Sub-facets: Leaders under People, Corporations under Markets
    • Clear phenomenon: the terms for the useful facets do not usually appear in the news stories
      • A journalist writing a story about Jacques Chirac will not necessarily use the terms Political Leader, Europe , or France . Such missing terms are tremendously useful for identifying the appropriate facets for the story
    • We will look for these terms elsewhere
      • infrequent terms in the original collection, but are frequent in expanded documents
  • 7. Context-Aware Expansion Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wiki Name Entities Yahoo Term Extractor Wikipedia Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wiki Text Wiki Text Wiki Text Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wiki Text Wiki Text Wordnet Text Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wiki Text Wiki Text Google Text Wordnet Google Wordnet Google
  • 8. Useful Facets Terms are Elsewhere Infrequent Terms Context-aware Collection t i Original Collection
  • 9.
    • Frequency-based shifting
    •  Due to the Zipfian nature, we favor terms that have already high frequencies (inverse problem)
    • Rank-shifting
    Term Frequency Analysis
  • 10. Summary: Candidate Facet Terms
    • For each document in the database, identify the important terms that are useful to characterize the contents of the document
    • For each term in the original database, query the external resource and retrieve the terms that appear in the results. Add the retrieved terms in the original document, in order to create an expanded, “context-aware” document
    • Analyze the frequency of the terms, in both the original and the expanded database and identify the candidate facet terms
  • 11. Indicative
  • 12. Research in Progress
    • Cleaning and filtering
    • Grouping similar facet terms under one facet
    • Evaluation
      • The resulted candidate terms
      • The resulted hierarchies