Web Information Retrieval and Mining

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Web Information Retrieval and Mining - Presentation Transcript

    1. Web Retrieval and Mining Overview Source: Ricardo Baeza-Yates and Carlos Castillo: “Web Retrieval and Mining”.Entry in “Encyclopedia of Library and Information Sciences”, third edition (to appear in 2009).
    2. Information Retrieval
      • Methods for finding information in documents
        • Started in the 1970s and 1980s
      • “ Methods ”
        • Algorithms and heuristics
      • “ Finding ”
        • Query – Document, Document – Document, etc.
      • “ Documents ”
        • Texts
    3. The Web is different
      • Massive
        • Thousands of millions of documents
      • Dynamic
        • Updates
        • Deletes
      • Distributed
        • Variable quality
        • Malicious behavior
    4. Web IR topics
      • Web Search
        • Crawling
        • Indexing
        • Querying
      • Web Mining
      • Adversarial Web IR
      • Distributed Web IR
      • Evaluation
    5. Web search
    6. Main goals
      • Precision
        • Relevant documents returned / Documents returned
      • Recall
        • Relevant documents returned / Relevant documents
      • Freshness
      • Performance/scalability
    7. Main goals
    8. Two phases of search
      • Off-line
        • Crawling and indexing
      • On-line
        • Querying and ranking
    9. Search phases
    10. Web crawling
      • Download pages following rules
      • Applications
        • Create index for search
        • Find particular information items
        • Find/report problems
      • Constraints
        • Robot exclusion protocol and politeness
        • Deep web
    11. Web indexing
      • Logical view
        • Tokenization
        • Stopwords removal
        • Stemming
      • Creation of an inverted index
    12. Inverted index
    13. Challenges of indexing
      • Index compression
      • Efficiency in top-K searches
        • Sorting
      • Index distribution
        • By terms
        • By documents
    14. Web querying and ranking
      • Keyword-based search is dominant paradigm
        • No large-scale open-domain QA systems (yet)
      • Relevance
        • Vector space model and variants
      • Query expansion
      • Latent semantic indexing
    15. Web ranking
      • Quality is the main problem
      • Link ranking
        • Hypothesis 1: Topical locality of links
        • Hypothesis 2: Link implies endorsment
      • PageRank
      • HITS
    16. HITS
    17. Rank manipulation
      • “ The bubble of Web visibility ”
      • Content spam
        • Keyword stuffing
        • Content hidding
      • Link spam
        • Link farms
      • Cloaking
    18. Web mining
    19. Content mining
      • Extraction of knowledge from Web pages
        • BUT ... HTML is physical formatting
        • There is information loss
    20. Information loss
    21. Aspects of content mining
      • Information extraction
        • Revert information loss
      • Content classification
        • Topic
        • Genre
      • Sentiment analysis
    22. Link mining
      • Scale-free networks
    23. Macroscopic view
      • Bow-tie structure
    24. Usage mining
      • Logfile analysis
      • Query logs
      • Privacy issues
    25. Emerging topics
      • Mobile Web
      • Semantic Web
      • ...

    + Carlos CastilloCarlos Castillo, 2 years ago

    custom

    1484 views, 1 favs, 1 embeds more stats

    Talk based on: Ricardo Baeza-Yates and Carlos Casti more

    More info about this document

    CC Attribution License

    Go to text version

    • Total Views 1484
      • 1480 on SlideShare
      • 4 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 54
    Most viewed embeds
    • 4 views on http://www.tejedoresdelweb.com

    more

    All embeds
    • 4 views on http://www.tejedoresdelweb.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories