Your SlideShare is downloading. ×
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Discovering knowledge using web structure mining
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Discovering knowledge using web structure mining

579

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
579
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
46
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1. What is Web?
  • 2. 1.1 Problems With Web  Difficulty in finding relevant information  Personalization of information  Learning about consumers or individual users
  • 3. 2.Objectives i. To Survey the area of web mining. ii. Introduction to Link Mining. iii. Review of HITS and Page Rank algorithm.
  • 4. 3. Web Mining: Definition  Process of discovering  potentially useful &  previously unknown information or knowledge from the web data.
  • 5. 3.1 Web Mining: Subtasks  Resource finding  Information selection and pre-processing  Generalization  Analysis
  • 6. 3.1 Web Mining Categories Web Mining Web Content Mining Web Structure Mining Text and Multimedia Documents Hyperlink Structure Web Usage Mining Web Log Records
  • 7. 3.1.1 Web Content Mining  Scanning data of a Web page to determine content relevance with respect to search query. Web Content Mining Agent Based Approach Database Approach
  • 8. 3.1.2 Web Structure Mining  Identifies relationships between Web pages.  Focuses on following problems  Reducing irrelevant search results.  Helps indexing information on the web.
  • 9. 3.1.3 Web Usage Mining  Focuses on techniques that predict user behavior while interacting with the WWW.  Web log records analyzed to discover user access pattern.  The challenges could be divided into three phases:  Pre-processing  Pattern discovery  Pattern Analysis
  • 10. 4. Link Mining  It is located at the intersection of the work in     Link analysis Hypertext and web mining Relational learning and inductive logic programming Graph mining.  Some tasks of link mining applicable in web structure mining are:      Linked-based classification Linked-based cluster analysis Link Type Link Strength Link Cardinality
  • 11. (i) Link-based Classification  Predicts category of a web page, based on  words that occur on the page  Links between pages  anchor text  HTML tags  and other possible attributes on web page.  Eg: Predicting the category of a paper, based on its citations and the co-citations.
  • 12. (ii) Link-based Cluster Analysis  Goal : Finding naturally occurring subclasses.  Data is segmented into groups  similar objects - grouped together  dissimilar objects - different groups.  Helps in discovering hidden patterns.  Eg: Finding diseases with similar transmission pattern.
  • 13. (iii) Link Type  Predicting link type between two entities.  Predicting purpose of a link.  Eg. Navigational or Advertising
  • 14. (iv) Link Strength  Links could be associated with weights.  Strong links - higher weight  Weak links – lower weight
  • 15. (v) Link Cardinality  Refers to the number of inbound links to a web site.  Link popularity :  combination of factors that weigh the importance of each incoming link.
  • 16. 5. Hyperlink-Induced Topic Search (HITS)  Link analysis algorithm that rates pages.  Identifies two kinds of pages from Web hyperlink structure: Web Pages With Links To Web Pages With  Authorities: Contains valuable information on the subject.  Hubs: Contains useful links towards the authoritative pages. Other Pages Hubs Content Authority
  • 17. HITS Contd…  Two step process:  Sampling step: Set of relevant pages collected  Iterative step: Hubs and authorities are found using output of above step
  • 18. HITS Contd…  Sampling Step:  Query submitted to search engine yields a root set  From root set we expand to base set Expanding the root set into base set
  • 19. HITS Contd…  Iterative step:  Associate non-negative authority weight x<p> and nonnegative hub weight y<p>. Computing Authority Weight Computing Hub Weight
  • 20. Problems With HITS Algorithm  Some problems with the HITS algorithm are:  Mutually reinforced relationships between hosts  Automatically generated links  Non-relevant nodes  Hubs and authorities  Topic drift  Efficiency
  • 21. 6. PageRank Model  It is a link analysis algorithm.  Numeric value to know the importance of a web page  Computes importance by no. of incoming links
  • 22. PageRank Contd…  Rank of a page is divided evenly among its out-links to contribute to the ranks of the pages they point to.  Page Ranks form a probability distribution over web pages, so the sum of all pages’ Page Ranks will be one.
  • 23. PageRank Contd…  PageRank can be calculated by: PR(A)= (1-d) + d (PR (T1)/C (T1) +…+ PR (Tn)/C (Tn))  T1..Tn are the pages that point to page A.  C(A) is defined as the number of links going out of page A.  d is the dampening factor which is usually set to 0.85  The dampening factor is the probability at each page a random surfer will get bored and will request another random page.
  • 24. Applications  HITS was used in Clever search engine by IBM.  PageRank is used by Google.
  • 25. References  Knowledge Discovery and Retrieval on World Wide Web Using Web Structure       Mining: Sekhar Babu Boddu, V.P Krishna Anne, Rajesekhara Rao Kurra and Durgesh Kumar Mishra, 2010, In proceedings of Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation (AMS), IEEE. Link Mining: A New Data Mining Challenge by Lise Getoor, 2003, SIGKDD Explorations, Volume 4, Issue 2 Authoritative Sources in a Hyperlinked Environment by Jon M. Kleinberg, 1998, In proceedings of ACM-SIAM Symposium on Discrete Algorithms The PageRank Citation Ranking: Bringing Order to the Web by L. Page, S. Brin and T. Winograd, 1998, Technical report, Stanford University wikipedia.org web-datamining.net maya.cs.depaul.edu

×