Your SlideShare is downloading. ×
Data.Mining.C.8(Ii).Web Mining 570802461
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Data.Mining.C.8(Ii).Web Mining 570802461

3,431
views

Published on

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,431
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
83
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1.
    • Chapter 8.
    • Mining Complex Types of Dat a (II)
    • --Web Mining--
  • 2. Chapter 8. Mining Complex Types of Data (II)
    • Introduction to Web mining
    • Web Structure Analysis
    • PageRank
    • HITS Approach
    • Summary
  • 3. Mining the World-Wide Web
    • The WWW is huge, widely distributed, global information service centre for
      • Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc.
      • Rich and dynamic Hyper-link( 超连接 ) information
      • Access and usage information (WEB 页面的访问和使用信息 )
    • WWW provides rich sources for data/text mining
    • Challenges
      • Too huge for effective data/text warehousing and mining
      • Too complex and heterogeneous: no standards and structure
  • 4. Web Mining: A challenging task
    • Researches for
      • Web access patterns ( 访问模式 )
      • Web structures and regularity
      • Web contents
    • Problems
      • The “ abundance ” problem
      • Limited coverage of the Web: hidden Web sources, majority of data in DBMS
      • Limited query interface based on keyword-oriented search
      • Etc.
  • 5. Web Mining Taxonomy Web Mining Web Structure Mining Web Content Mining Web Page Content Mining Search Result Mining Web Usage Mining General Access Pattern Tracking Customized Usage Tracking
  • 6. Mining the World-Wide Web Web Structure Mining Web Content Mining
    • Web Page Content Mining
    • WebLog ( Lakshmanan et.al. 1996 ) , WebOQL ( Mendelzon et.al. 1998 ) …:
    • Can identify information within given web pages
    • Ahoy! ( Etzioni et.al. 1997 ):Uses heuristics to distinguish personal home pages from other web pages
    • ShopBot ( Etzioni et.al. 1997 ) : Looks for product prices within web pages
    Search Result Mining Web Usage Mining General Access Pattern Tracking Customized Usage Tracking Web Mining
  • 7. Mining the World-Wide Web Web Usage Mining General Access Pattern Tracking Customized Usage Tracking Web Structure Mining Web Content Mining Web Page Content Mining
    • Search Result Mining
    • Search Engine Result Summarization
    • Clustering Search Result ( Leouski and Croft, 1996, Zamir and Etzioni, 1997 ):
    • Categorizes documents using phrases in titles
    Web Mining
  • 8. Mining the World-Wide Web Web Structure Mining Web Content Mining Web Page Content Mining Search Result Mining Web Usage Mining
    • General Access Pattern Tracking
    • Web Log Mining ( Zaïane, Xin and Han, 1998 )
    • Uses DM techniques to understand general access patterns and trends.
    • Can shed light on better structure and grouping of resource providers.
    Customized Usage Tracking Web Mining
  • 9. Mining the World-Wide Web Web Usage Mining General Access Pattern Tracking
    • Customized Usage Tracking
    • Adaptive Sites ( Perkowitz and Etzioni, 1997 )
    • Analyse access patterns of each user at a time.
    • Web site restructures itself automatically by learning from user access patterns.
    Web Structure Mining Web Content Mining Web Page Content Mining Search Result Mining Web Mining
  • 10. Mining the World-Wide Web Web Content Mining Web Page Content Mining Search Result Mining Web Usage Mining General Access Pattern Tracking Customized Usage Tracking
    • Web Structure Mining
    • Using Links
    • HITS ( Kleinberg, 1998 )
    • PageRank ( Sergey Brin and Larry Page ,1998)
    • Amount of Web linkage information provides rich information about the relevance, the quality and structure of the Web’s content
    • Use interconnections between web pages to give weight to pages.
    • .
    Web Mining
  • 11. Chapter 8. Mining Complex Types of Data (II)
    • Introduction to Web mining
    • Web Structure Analysis
    • PageRank
    • HITS Approach
    • Summary
  • 12. Introduction
    • Early search engines mainly compare the similarity of the query and the indexed pages. i .e.,
      • They use information retrieval methods, cosine , ...
    • From 1996, it became clear that the similarity alone was no longer sufficient .
      • The number of pages grew rapidly in the mid-late 1990’s.
        • Google estimates: 10 million relevant pages.
        • How to choose only 30-40 pages and rank them suitably to present to the user?.
  • 13. Web Structure Analysis
    • Starting around 1996, researchers began to work on the problem. They resort to hyperlinks (超连接) .
    • Web pages on the other hand are connected through hyperlinks, which carry important information.
      • Some hyperlinks : organize information at the same site.
      • Other hyperlinks : point to pages from other Web sites. Such out-going hyperlinks often indicate an implicit conveyance of authority (权威)
      • to the pages being pointed to.
    • Those pages that are pointed to by many other pages are likely to contain authoritative information.
  • 14. Web Structure Analysis
    • During 1997-1998, two most influential hyperlink - based search algorithms PageRank and HITS were reported.
    • Both algorithms exploit the hyperlinks of the Web to rank pages
      • PageRank : Sergey Brin and Larry Page, PhD students from Stanford University, at Seventh International World Wide Web Conference ( WWW ) in April, 1998.
      • HITS : Jon Kleinberg (Cornell University), at Ninth Annual ACM-SIAM Symposium on Discrete Algorithms , January 1998
  • 15. Chapter 8. Mining Complex Types of Data (II)
    • Introduction Web mining
    • Web Structure Analysis
    • PageRank
    • HITS Approach
    • Summary
  • 16. Background: Social Network Analysis
    • Social network : the study of social entities (people in an organization )
    • - actors ( 主体 ) , their interactions / relationships .
    • Interactions / relationships : represented by network or graph ,
      • each vertex (or node) : an actor
      • each link : a relationship.
    • From the network, we can study
    • - properties of its structure
    • - actor: the role, position and prestige ( 声望 )
    • C ommunities : various kinds of sub-graphs , formed by groups of actors.
  • 17. Social Network and the Web
    • Web : viewed as a virtual social network
      • Each page: a ctor
      • each hyperlink: relationship.
    • R esults from social network can be adapted and extended for use in the Web context.
    • T wo types of social network analysis,
    • - centrality and prestige
    • closely related to hyperlink analysis and search on the Web.
  • 18. Centrality
    • A n actor with extensive contacts (links) or communications with many other actors in the organization is considered more important than a n actor with relatively fewer contacts.
    • Ce ntral actor : one involved in many link s.
  • 19. Measure of Centrality
    • Network: viewed as a directed graph
    • In-links of actor i : links pointing to i
    • Out-links of actor i : links pointing out from i
    • The simple degree centrality of actor i :
    • C( i ) = d out ( i )/( n-1 )
    • where d out ( i ) the number of out-links of actor i and
    • n the total number of actors in the network
    • Dividing n-1 standardizes the centrality value into range [0,1]
  • 20. Prestige
    • Prestige : more refined measure of prominence of an actor
    • than centrality.
    • P restigious actor :
    • one of extensive ties as a recipient use d only in-links.
    • Difference between centrality and prestige:
      • centrality focuses on out-links
      • prestige focuses on in-links
  • 21. Measure of P restige
    • In-links of actor i : links pointing to i
    • The simple degree Prestige of actor i :
    • P( i ) = d in ( i )/( n-1 )
    • where d in ( i ) the number of in-links of actor i and
    • n the total number of actors in the network
  • 22. Rank P restige
    • Rank prestige forms the basis of most Web page link analysis algorithms for PageRank .
    • In the real world, a person i chosen by an important person is more prestigious than chosen by a less important person.
      • For example, if a company CEO votes for a person is much more important than a worker votes for the person.
    • If one ’ s circle of influence is full of prestigious actors, then one ’ s own prestige is also high.
      • Thus one ’ s prestige is affected by the ranks or statuses of the involved actors.
  • 23. Measure of Rank P restige
    • R ank prestige P R ank ( i ) : a linear combination of links that point to i :
    • P R ank ( i ) = A 1i P R ank ( 1 ) + A 2i P R ank ( 2 ) + … + A ni P R ank ( n )
    • where A ji =1 if j points to i and 0 otherwise.
    • We have n equations for n actors --- mathematically we can write them as the column vector P :
    • A: the adjacency matrix of network (graph), where A ij =1 if i points to j and 0 otherwise
  • 24. Intuition Idea for Rank Prestige
    • A hyperlink from a page to another page is an implicit conveyance of authority to the target page.
      • The more in-links that a page i receives, the more prestige the page i has.
    • Pages that point to page i also have their own prestige scores.
      • A page of a higher prestige pointing to i is more important than a page of a lower prestige pointing to i .
      • In other words, a page is important if it is pointed to by other important pages.
    • This is exactly the idea of rank prestige in social network.
  • 25. PageRank Algorithm
    • According to rank prestige , the importance of page i ( i ’ s PageRank score ) is the sum of the PageRank scores of all pages that point to i .
    • The Web as a directed graph G = ( V , E ). Let the total number of pages be n . The PageRank score of the page i (denoted by P ( i )) is defined by:
    O j is the number of out-link of j
  • 26. Matrix Notation
    • Let P be a n -dimensional column vector of PageRank values, i.e., P = ( P (1), P (2), … , P ( n )) T .
    • Let A be the adjacency matrix of our graph with
    • Here we use O i to denote the number of out-links of a node i .
    • Each transition probability is 1/ O i if we assume the Web surfer will click the hyperlinks in the page i uniformly at random.
  • 27. Transition Probability Matrix
    • Let A be the state transition probability matrix
    • A ij : the transition probability that the surfer in state i (page i ) will move to state j (page j ).
  • 28. Let us start …
    • Given an initial probability distribution vector that a surfer is at each state (or page)
      • p 0 = ( p 0 (1), p 0 (2), … , p 0 ( n )) T (a column vector) and
      • an n  n transition probability matrix A ,
      • we have
  • 29. Random Surfer
    • State transition:
    • Where A ij (1) is the probability of going from i to j after 1 transition, we can write
    • In general, the probability distribution after k steps/transition:
  • 30. An Example Web Hyperlink Graph
  • 31. Improved PageRank
    • At a page, the random surfer has two options
      • With probability d , he randomly chooses an out-link to follow.
      • With probability 1- d , he jumps to a random page
    • I mproved model :
    • where E is ee T ( e is a column vector of all 1 ’ s) and thus E is a n  n square matrix of all 1 ’ s.
  • 32. Follow the Above Example
  • 33. Final PageRank Algorithm
    • PageRank for each page i is
  • 34. Final PageRank Algorithm
    • equivalent to the formula given in the PageRank algorithm
    • The parameter d is called the damping factor which can be set to between 0 and 1. d = 0.85 was used in the PageRank agorithm .
  • 35. Compute PageRank
    • Use the iteration method
    • PageRank-Iterate (G)
    • ;
    • k=1;
    • repeat
    • ;
    • k=k+1;
    • until ;
    • return
  • 36. Advantages of PageRank
    • PageRank is a global measure and query independent .
      • PageRank values of all the pages are computed and saved off-line rather than at the query time.
    • Criticism : Query-independence. It could not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic.
    • Nie, et al. Topical Link Analysis for Web Search, SIGIR 2006
  • 37. Chapter 8. Mining Complex Types of Data (II)
    • Introduction to Web mining
    • Web Structure Analysis
    • PageRank
    • HITS Approach
    • Summary
  • 38. Another Aim: Web Structure Analysis
    • H yperlinks are also useful for finding Web communities .
      • A Web community is a cluster of densely linked pages representing a group of people with a special interest.
    • Beyond explicit hyperlinks on the Web, links in other contexts are useful too , e.g.,
      • for discovering communities of named entities (e.g., people and organizations)
      • for analyzing social phenomena in emails.
  • 39. Background: Co-citation and Bibliographic Coupling
    • An typical area of research concerned with links is citation analysis ( 引证分析 ) of scholarly publications.
      • A scholarly publication cites related prior work to acknowledge the origins of some ideas and to compare the new proposal with existing work.
    • When a paper cites another paper, a relationship is established between the publications.
    • We discuss two types of citation analysis, co-citation ( 共引证 ) and bibliographic coupling ( 文献联结 ) . The HITS algorithm is related to these two types of analysis.
  • 40. Co-citation
    • If papers i and j are both cited by paper k , then they may be related in some sense to one another.
    • The more papers they are co-cited by, the stronger their relationship is.
    Fig. Paper i and paper j are co-cited by paper k
  • 41. Co-citation (共引证)
    • Let L be the citation matrix. Each cell of the matrix is defined as follows:
      • L ij = 1 if paper i cites paper j , and 0 otherwise.
    • Co-citation (denoted by C ij ) is a similarity measure defined as the number of papers that co-cite i and j ,
    • A square matrix C can be formed with C ij , and it is called the co-citation matrix .
  • 42. Bibliographic C oupling (文献联结)
    • Bibliographic coupling operates on a similar principle.
    • Bibliographic coupling links papers that cite the same articles
      • if papers i and j both cite paper k , they may be related.
    • The more papers they both cite, the stronger their similarity is.
    Fig. Both paper i and paper j cite paper k
  • 43. Bibliographic C oupling
    • Bij represents the number of papers that are cited by both paper i and j
    • A bibliographic coupling matrix B (can be formed with Bij) is symmetric and is regarded as a similarity measure of two papers in clustering
  • 44. HITS
    • HITS --- Hypertext Induced Topic Search .
    • HITS is search query dependent for finding Web communities
    • When the user issues a search query,
      • HITS first expands the list of relevant pages returned by a search engine and
      • then produces two rankings of the expanded set of pages, i.e.,
      • authority pages and hub pages .
  • 45. Authorities and Hubs
    • A uthority : Roughly, an authority is a page with many in-links.
      • The idea is that the page may have good or authoritative content on some topic and
      • thus many people trust it and link to it.
    • Hub : A hub is a page with many out-links.
      • The page serves as an organizer of the information on a particular topic and
      • points to many good authority pages on the topic.
  • 46. Mining the Web's Link Structures
    • Finding authoritative Web pages( 权威页面 )
      • Retrieving pages that are not only relevant, but also of high quality, or authoritative on the topic
    • Hyperlinks( 超连接 ) can infer the notion of authority
      • A hyperlink pointing to another Web page, this can be considered as the author ’ s endorsement ( 认可 ) of the other page
    • Hub pages (Hub 页面 ): Web pages that provides collections of links to authorities
  • 47. Mining the Web's Link Structures
    • Mutually reinforcing relationship( 相互增强关联 ):
    • a good hub is a page that points to many good authorities;
    • a good authority is page that is pointed to by many good hubs
    … Authority page (red) … Hub page (yellow) Hubs Authorities
  • 48. Define Authority and Hub Weight for Each Page For the page p : authority weight ; hub weight q 1 q 2 q 3 page p a[p]:= sum of h[q], for  q, q  p q 1 q 2 q 3 page p h[p]:= sum of a[q], for  q, p  q Better authority (hub) pages with larger a(h)-values
  • 49. The HITS Algorithm d 1 d 2 d 4 d 3
    • HITS works on the pages in S(web space) , and assigns every page in S an authority score and a hub score .
    • Let the number of pages in S be n .
    • We again use G = ( V , E ) to denote the hyperlink graph of S .
    • We use L to denote the adjacency matrix of the graph.
  • 50. The HITS Algorithm
    • Let the authority score of the page i be a ( d i ), and the hub score of page i be h ( d i ).
    • The mutual reinforcing relationship of the two scores is represented as follows:
  • 51. HITS in Matrix Form
    • We use a to denote the column vector with all the authority scores,
    • a = ( a ( d 1 ), a ( d 2 ), … , a ( d n )) T , and
    • use h to denote the column vector with all the authority scores,
    • h = ( h ( d 1 ), h ( d 2 ), … , h ( d n )) T ,
    • Then,
    • a = L T h
    • h = La
  • 52. Computation of HITS
    • The computation of authority scores and hub scores : using power iteration (迭代) .
    • If we use a k and h k to denote authority and hub vectors at the k th iteration, the iterations for generating the final solutions are
  • 53. Relationships with C o-citation and B ibliographic C oupling
    • Recall that co-citation of pages i and j , denoted by C ij , is
      • the authority matrix ( L T L ) of HITS is the co-citation matrix C
    • bibliographic coupling of two pages i and j , denoted by B ij is
      • the hub matrix ( LL T ) of HITS is the bibliographic coupling matrix B
  • 54. HITS ( H yperlink- I nduced T opic S earch)
    • Explore interactions between hubs and authoritative pages
    • Use a term-index search engine to form the root set
      • Many of these pages are presumably relevant to the search topic (query)
      • Some of them should contain links to most of the prominent authorities
    • Expand the root set into a base set
      • all of the pages that the root-set pages link to, and
      • all of the pages that link to a page in the root set, up to a designated size cutoff
  • 55. Root Set ( 根集 ) and Base Set( 基集 )
    • Properties of base set (ideally)
      • Relatively small
      • Rich in relevant pages
      • Contain most (many) of the strongest authorities
    base root
  • 56. Step 1 of HITS: Create Base Set from Root Set
    • Subgraph(  ,  , t , d )
    •  : a query string
    •  : a text-based search engine
    • t , d : natural number // t =200; d =50
    • Let R  denote the top t results of  on  // R  root set
    • Set S  := R 
    • For each page p  R  // html_content  get_url(url)
    • Let W ( p ) denote the set of all pages p points to
    • Let V ( p ) denote the set of all pages pointing to p
    • Add all pages in W ( p ) to S 
    • If | V ( p ) |  d , then
    • add all pages in V ( p ) to S 
    • Else
    • add an arbitrary set of d pages from V ( p ) to S 
    • End
    • Return S  // S  base set : ca.1000 – 5000
  • 57. Step 1 of HITS: Create Base Set from Root Set
    • For instance,
    •   http://search.yahoo.com/bin/search?p=Data+Mining&ei=UTF-8
    •  http://search.yahoo.com/search?p=Data+Mining&ei=UTF-8&b=21
    •  http://search.yahoo.com/search?p=Data+Mining&ei=UTF-8&b=41
    • … …
    • Two types of links in S  :
    • transverse : between pages with different domain name;
    • intrinsic : between pages with same domain name;
    • (domain name: the first level of URL string of a page)
    • G  : deleting all intrinsic links from S 
  • 58. The HITS Algorithm d 1 d 2 d 4 “ Adjacency matrix” d 3 Initial values: a=h=1 Iterate Normalize:
  • 59. Step 2 of HITS: Calculate Authority and Hub Weight for Each Page
    • Iterate( G  )
    • G  : a collection of n linked pages
    • k= 1
    • Repeat
    • normalize a k , h k
    • k=k+1
    • Until a k and h k do not change significantly
    • Return (a k , h k ).
  • 60. Step 3 of HITS: Filter out the top authorities and hubs
    • Filter( G  , c )
    • G  : a collection of n linked pages
    • k , c : natural number
    • ( x k , y k ) := Iterate( G  ).
    • Report the pages with the c largest coordinates in x k as authorities .
    • Report the pages with the c largest coordinates in y k as hubs .
  • 61. Strengths and W eaknesses of HITS
    • Strength : its ability to rank pages according to the query topic, which may be able to provide more relevant authority and hub pages.
    • Weaknesses :
      • It is in fact quite easy to influence HITS since adding out-links in one ’ s own page is so easy.
      • Inefficiency at query time : The query time evaluation is slow. Collecting the root set, expanding it and performing eigenvector computation are all expensive operations
    • Reference( 文献 )
    • Jon M. Kleinberg: Authoritative Sources in a Hyper-linked Environment , Journal of ACM, Vol.46(5), 1999, pp604-632 http://www.cs.cornell.edu/home/kleinber/kleinber.html
  • 62. Chapter 8. Mining Complex Types of Data (II)
    • Introduction to Web mining
    • Web Structure Analysis
    • PageRank
    • HITS Approach
    • Summary
  • 63. Summary
    • Web mining includes mining Web link structures to identify authoritative Web pages, Web content and Web usage mining
    • We introduced
      • PageRank & Social network analysis, centrality and prestige
      • HITS & Co-citation and bibliographic coupling
  • 64. Summary
    • Important to note : Hyperlink based ranking is not the only algorithm used in search engines. In fact, it is combined with many content based factors to produce the final ranking presented to the user.
    • Links can also be used to find communities , which are groups of content-creators or people sharing some common interests.
      • Web communities
      • Email communities
      • Named entity communities, etc.