Presented By: Akshat Saxena  Anjul Sahu
Definition <ul><li>Application of  data mining techniques on the web to discover interesting patterns. </li></ul>
Introduction <ul><li>Size of web is extremely large </li></ul><ul><li>Data present on web is unstructured </li></ul><ul><l...
Web Mining Taxonomy
Web Content Mining <ul><li>Extends work of search engine </li></ul><ul><li>Improves on traditional crawler technique </li>...
Web Content Mining Subtasks <ul><li>Resource finding </li></ul><ul><ul><li>Retrieving intended documents </li></ul></ul><u...
Text Mining
Web Crawler <ul><li>Program which browses WWW in a methodical, automated manner </li></ul><ul><li>Copy in cache and do Ind...
Focused Crawler
Focused Crawler <ul><li>Visits only pages of interest </li></ul><ul><li>Architecture consists of: </li></ul><ul><ul><li>Hy...
Context Focused Crawler <ul><li>Focused crawler was static </li></ul><ul><li>Drawbacks: </li></ul><ul><ul><li>Non-relevant...
Harvest System <ul><li>Uses caching, indexing and crawling </li></ul><ul><li>Act as a tool in gathering information from o...
Virtual Web View <ul><li>Web as multiple layer database  </li></ul><ul><li>A view of MLDB is virtual web view </li></ul><u...
Personalization <ul><li>Contents of web are modified as per user’s desires </li></ul><ul><li>Personalized not targeted </l...
Personalization <ul><li>Types: </li></ul><ul><ul><li>User preference </li></ul></ul><ul><ul><li>Collaborative filtering </...
Personalization  <ul><li>Yahoo was the first to introduce the concept of a ’personalized portal’, i.e. a Web site designed...
Web Structure Mining <ul><li>Creating a model of web organization </li></ul><ul><li>Classify web pages </li></ul><ul><li>C...
PageRank TM <ul><li>Link analysis algorithm which assigns numerical weight to a webpage. </li></ul><ul><li>The numerical w...
Page Rank <ul><li>Increase effectiveness of search engines </li></ul><ul><li>Based on number of back links </li></ul><ul><...
Clever System <ul><li>Finds both authoritative pages and hubs </li></ul><ul><li>Authoritative - best source </li></ul><ul>...
Alternatives to PageRank <ul><li>HITS Algorithm </li></ul><ul><li>IBM Clever Project </li></ul><ul><li>TrustRank </li></ul...
Web Usage Mining <ul><li>Applies mining on web usage data or weblogs or clickstream data </li></ul><ul><li>Client perspect...
Trackers for site usage and analysis
 
Issues in Web Log <ul><li>Identify exact user </li></ul><ul><li>Exact sequence of pages visited </li></ul><ul><li>Security...
Preprocessing <ul><li>Information not in presentable format </li></ul><ul><li>Data cleaning required </li></ul><ul><li>Log...
Data Structure <ul><li>DS needed to keep track of patterns identified </li></ul><ul><li>DS used is  trie </li></ul><ul><li...
Pattern Discovery <ul><li>Traversal pattern - pages visited in a session </li></ul><ul><li>Properties: </li></ul><ul><ul><...
Pattern Discovery <ul><li>Sequential Pattern - ordered set satisfying a support and maximal </li></ul><ul><li>Similar to a...
Queries ‘N Suggestions <ul><li>References:  </li></ul><ul><ul><li>http://maya.cs.depaul.edu/~mobasher/webminer/survey/ </l...
Upcoming SlideShare
Loading in...5
×

Web Mining

15,069
-1

Published on

A Complete Overview of Web Mining

Published in: Technology
2 Comments
14 Likes
Statistics
Notes
No Downloads
Views
Total Views
15,069
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
1,168
Comments
2
Likes
14
Embeds 0
No embeds

No notes for slide

Web Mining

  1. 1. Presented By: Akshat Saxena Anjul Sahu
  2. 2. Definition <ul><li>Application of data mining techniques on the web to discover interesting patterns. </li></ul>
  3. 3. Introduction <ul><li>Size of web is extremely large </li></ul><ul><li>Data present on web is unstructured </li></ul><ul><li>Good scope of data mining </li></ul><ul><li>Types of data on web </li></ul><ul><ul><li>Content of actual webpage </li></ul></ul><ul><ul><li>Intrapage structure </li></ul></ul><ul><ul><li>Interpage structure </li></ul></ul><ul><ul><li>Usage data </li></ul></ul><ul><ul><li>User profiles and cookies </li></ul></ul>
  4. 4. Web Mining Taxonomy
  5. 5. Web Content Mining <ul><li>Extends work of search engine </li></ul><ul><li>Improves on traditional crawler technique </li></ul><ul><li>Use data mining for efficiency, effectiveness and scalability </li></ul><ul><li>Further divided into </li></ul><ul><ul><li>Agent based approach </li></ul></ul><ul><ul><li>Database based approach </li></ul></ul><ul><li>Text mining is/isn’t content mining </li></ul><ul><li>Crawlers </li></ul><ul><li>Personalization </li></ul>
  6. 6. Web Content Mining Subtasks <ul><li>Resource finding </li></ul><ul><ul><li>Retrieving intended documents </li></ul></ul><ul><li>Information selection/pre-processing </li></ul><ul><ul><li>Select and pre-process specific information from selected documents </li></ul></ul><ul><li>Generalization </li></ul><ul><ul><li>Discover general patterns within and across web sites </li></ul></ul><ul><li>Analysis </li></ul><ul><ul><li>Validation and/or interpretation of mined patterns </li></ul></ul>
  7. 7. Text Mining
  8. 8. Web Crawler <ul><li>Program which browses WWW in a methodical, automated manner </li></ul><ul><li>Copy in cache and do Indexing </li></ul><ul><li>Starts from a seed url </li></ul><ul><li>Searches and finds links, keywords </li></ul><ul><li>Types of Crawler </li></ul><ul><ul><li>Context focused </li></ul></ul><ul><ul><li>Focused </li></ul></ul><ul><ul><li>Incremental </li></ul></ul><ul><ul><li>Periodic </li></ul></ul>
  9. 9. Focused Crawler
  10. 10. Focused Crawler <ul><li>Visits only pages of interest </li></ul><ul><li>Architecture consists of: </li></ul><ul><ul><li>Hyperlink Classifier </li></ul></ul><ul><ul><li>Distiller </li></ul></ul><ul><ul><li>Crawler </li></ul></ul><ul><li>Hub pages - links to relevant pages </li></ul><ul><li>Hard focus - parent node relevant </li></ul><ul><li>Soft focus - probability of relevance </li></ul><ul><li>Harvest rate – precision rate </li></ul>
  11. 11. Context Focused Crawler <ul><li>Focused crawler was static </li></ul><ul><li>Drawbacks: </li></ul><ul><ul><li>Non-relevant pages having links to relevant ones. These to be followed </li></ul></ul><ul><ul><li>Relevant ones not having links to other relevant ones. Backward crawling </li></ul></ul><ul><li>CFC in two steps </li></ul><ul><ul><li>Construct context graphs and classifiers </li></ul></ul><ul><ul><li>Crawl using these classifiers </li></ul></ul>
  12. 12. Harvest System <ul><li>Uses caching, indexing and crawling </li></ul><ul><li>Act as a tool in gathering information from other sources </li></ul><ul><li>Components: </li></ul><ul><ul><li>Gatherer - obtains information </li></ul></ul><ul><ul><li>Broker - provides index and query interface </li></ul></ul><ul><li>Essence systems </li></ul><ul><li>Semantic indexing </li></ul>
  13. 13. Virtual Web View <ul><li>Web as multiple layer database </li></ul><ul><li>A view of MLDB is virtual web view </li></ul><ul><li>No spiders used </li></ul><ul><li>Websites send their indices to others </li></ul><ul><li>WebML – DMQL for web mining </li></ul><ul><li>KEYWORDS – covers, covered by, like, close to </li></ul><ul><li>Difficult to implement </li></ul>
  14. 14. Personalization <ul><li>Contents of web are modified as per user’s desires </li></ul><ul><li>Personalized not targeted </li></ul><ul><li>Use cookies, userID, profile information </li></ul><ul><li>Legal issues to be considered </li></ul><ul><li>Includes clustering, classification or even prediction </li></ul>
  15. 15. Personalization <ul><li>Types: </li></ul><ul><ul><li>User preference </li></ul></ul><ul><ul><li>Collaborative filtering </li></ul></ul><ul><ul><li>Content based filtering </li></ul></ul><ul><li>Example : My Yahoo! was first. Now almost every service offers personalization. </li></ul>
  16. 16. Personalization <ul><li>Yahoo was the first to introduce the concept of a ’personalized portal’, i.e. a Web site designed to have the look-and-feel as well as content personalized to the needs of an individual end-user. </li></ul><ul><li>Mining MyYahoo usage logs provides Yahoo valuable insight into an individual’s Web usage habits, enabling Yahoo to provide compelling personalized content, which in turn has led to the tremendous popularity of the Yahoo Web site. </li></ul>
  17. 17. Web Structure Mining <ul><li>Creating a model of web organization </li></ul><ul><li>Classify web pages </li></ul><ul><li>Create similarity measures between web pages </li></ul><ul><li>Page Rank </li></ul><ul><li>The Clever system </li></ul><ul><li>Hyperlink induced topic search(HITS) </li></ul>
  18. 18. PageRank TM <ul><li>Link analysis algorithm which assigns numerical weight to a webpage. </li></ul><ul><li>The numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E). </li></ul><ul><li>the PageRank value for a page u is dependent on the PageRank values for each page v out of the set B u (this set contains all pages linking to page u ), divided by the number L ( v ) of links from page v . </li></ul>
  19. 19. Page Rank <ul><li>Increase effectiveness of search engines </li></ul><ul><li>Based on number of back links </li></ul><ul><li>Rank sink problem exists </li></ul>
  20. 20. Clever System <ul><li>Finds both authoritative pages and hubs </li></ul><ul><li>Authoritative - best source </li></ul><ul><li>Hub - link to authoritative pages </li></ul><ul><li>Most value page returned </li></ul><ul><li>Hyperlink Induced Topic Search </li></ul><ul><ul><li>Keywords </li></ul></ul><ul><ul><li>Authority and hub measure </li></ul></ul>
  21. 21. Alternatives to PageRank <ul><li>HITS Algorithm </li></ul><ul><li>IBM Clever Project </li></ul><ul><li>TrustRank </li></ul><ul><li>But PageRank is the most popular and widely used algorithm by search engines </li></ul>
  22. 22. Web Usage Mining <ul><li>Applies mining on web usage data or weblogs or clickstream data </li></ul><ul><li>Client perspective </li></ul><ul><li>Server perspective </li></ul><ul><li>Aid in personalization </li></ul><ul><li>Helps in evaluating quality and effectiveness </li></ul><ul><li>Preprocessing, pattern discovery and data structures </li></ul>
  23. 23. Trackers for site usage and analysis
  24. 25. Issues in Web Log <ul><li>Identify exact user </li></ul><ul><li>Exact sequence of pages visited </li></ul><ul><li>Security, privacy and legal issues </li></ul>
  25. 26. Preprocessing <ul><li>Information not in presentable format </li></ul><ul><li>Data cleaning required </li></ul><ul><li>Log: (<src id>,<literal>,<timestamp>) </li></ul><ul><li>Data might be grouped </li></ul><ul><li>Sessions </li></ul><ul><li>Path completion </li></ul>
  26. 27. Data Structure <ul><li>DS needed to keep track of patterns identified </li></ul><ul><li>DS used is trie </li></ul><ul><li>A rooted tree where each path from root to node represents a sequence </li></ul>
  27. 28. Pattern Discovery <ul><li>Traversal pattern - pages visited in a session </li></ul><ul><li>Properties: </li></ul><ul><ul><li>Duplicate reference may / may not be allowed </li></ul></ul><ul><ul><li>Consist of only contiguous page reference </li></ul></ul><ul><ul><li>Pattern may / may not be maximal </li></ul></ul><ul><li>Association rules - pages accessed together </li></ul>
  28. 29. Pattern Discovery <ul><li>Sequential Pattern - ordered set satisfying a support and maximal </li></ul><ul><li>Similar to apriori algorithm </li></ul><ul><li>Web access pattern - efficient counting </li></ul><ul><li>Episodes – partially ordered by access time; users not identified </li></ul><ul><li>Pattern analysis </li></ul>
  29. 30. Queries ‘N Suggestions <ul><li>References: </li></ul><ul><ul><li>http://maya.cs.depaul.edu/~mobasher/webminer/survey/ </li></ul></ul><ul><ul><li>Google.com/Technology </li></ul></ul><ul><ul><li>http://www.almaden.ibm.com/projects/clever.shtml </li></ul></ul><ul><ul><li>Thanks !!  </li></ul></ul><ul><ul><li>{akshatsaxena11, anjulsahu}@gmail.com </li></ul></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×