Data Mining -  Web Mining Dr Heng Tao SHEN ©The University of Queensland, Brisbane Australia  http://:www.itee.uq.edu.au/~...
Outline <ul><li>Web Content Mining  </li></ul><ul><li>Web Usage Mining  </li></ul><ul><li>Web Structure Mining </li></ul>
What is Web Mining? <ul><li>Web data mining -  techniques to automatically discover and extract information from Web docum...
Mining the Web <ul><li>Web is an information source for:  </li></ul><ul><ul><li>Information services </li></ul></ul><ul><u...
Web Mining: challenges <ul><li>Searches for  </li></ul><ul><ul><li>Regularity and dynamics of Web contents </li></ul></ul>...
A Taxonomy of Web Mining Web Mining Web Structure Mining Web Content Mining Web Page Content Mining Search Result Mining W...
Web Content Mining <ul><li>Discovery of useful information from web contents / data / documents </li></ul><ul><ul><li>Web ...
Issues in Web Content Mining <ul><li>Developing intelligent tools for  IR </li></ul><ul><ul><li>Finding keywords and key p...
Web Content Mining - implementation <ul><li>Information Filtering/Categorization </li></ul><ul><ul><li>Collaborative Filte...
Information Filtering/Categorization <ul><li>Using various information retrieval techniques and characteristics of open hy...
How do We Find Similar Web Pages? <ul><li>Content based approach </li></ul><ul><li>Structure based approach </li></ul><ul>...
HyPursuit:  Similarity Functions for Web Pages with Hyperlinks The hyperlink similarity between two Hypertext document: Wh...
How can we recommend contents to users? <ul><li>Recommender systems: </li></ul><ul><ul><li>Content based recommender algor...
Background: Information Filtering <ul><li>Why IF? </li></ul><ul><ul><li>Save users’ time </li></ul></ul><ul><ul><li>Find t...
Assumption <ul><li>Data about  users’ preferences can  be collected.  </li></ul>User  profiles user preferences on items
Definition of CF problem <ul><li>Given  a  dataset  D  as a tuple  <Ui,Ij,Oij> </li></ul><ul><li>Where  Ui  identifies the...
A Mapping of Two High-Dimensional Spaces Q1: For a given kind of items, what kind of customers   would like it? Q2: For ce...
Applications of Collaborative filtering <ul><ul><li>Digital library   [ Seikyung Jung,CIKM04 ] </li></ul></ul><ul><ul><li>...
Types of Recommendation Methods: Model-based or Memory-based collaborative filtering <ul><ul><li>Model-based  </li></ul></...
M emory -based  collaborative filtering <ul><li>User-based recommendation </li></ul><ul><ul><li>Find similar users as the ...
Challenges to Collaborative filtering  <ul><li>Prediction Precision   </li></ul><ul><li>Scalability :  the number of users...
W eb  U sage  M ining <ul><li>Web Log Mining </li></ul><ul><ul><li>Pre-processing </li></ul></ul><ul><ul><li>Pattern minin...
W eb  U sage  M ining - Applications <ul><li>Target potential customers for e-commerce </li></ul><ul><li>Enhance the quali...
W eb  U sage  M ining - Outcome <ul><li>Association rules </li></ul><ul><li>–  Find pages that are often viewed together <...
W eb  U sage  M ining - Phases <ul><li>Three distinctive phases: </li></ul><ul><ul><li>preprocessing, </li></ul></ul><ul><...
Phase 1:  Pre-processing <ul><li>Converts the raw data into the data abstraction necessary for the further applying the da...
Raw data – Web log <ul><li>Click stream : a sequential series of page view request </li></ul><ul><li>User session : a deli...
Phase 2: Pattern Discovery <ul><li>Pattern Discovery uses techniques such as  statistical analysis, association rules, clu...
Phase 3: Pattern Analysis   <ul><li>A process to gain Knowledge about how visitors use Website in order to </li></ul><ul><...
Web Structure Mining   <ul><li>To  discover the link structure  of the hyperlinks at the inter-document level to generate ...
Web Structure Mining - Applications <ul><li>Web pages categorization/ranking  </li></ul><ul><li>Communities discovery </li...
Well-known Methods <ul><li>HITS  (Topic distillation) </li></ul><ul><li>PageRank  (Ranking web pages used by Google) </li>...
HITS <ul><li>H yperlink  I nduced  T opic  S earch. </li></ul><ul><li>A simple approach by finding hubs and authorities. <...
HITS: Main Idea <ul><li>Concerned with the identification of the most  authoritative , or  definitive , Web pages on a bro...
HITS: Hubs and Authority <ul><li>Hub : web page links to a collection of prominent sites on a common topic. </li></ul><ul>...
HITS: Two Main Steps <ul><li>A  sampling   component, which constructs a focused collection of several thousand web pages ...
HITS: Drawbacks <ul><li>Limit On Narrow Topics </li></ul><ul><ul><li>Not enough authoritative pages </li></ul></ul><ul><ul...
PageRank <ul><li>Introduced by Brin and Page (1998). </li></ul><ul><li>Mine hyperlink structure of web to produce ‘global’...
PageRank: Main Idea <ul><li>A page has a high rank if the  sum of the ranks of its back-links  is high. </li></ul><ul><li>...
What is Web Community? <ul><li>A cyber  community  on the  web is a group of web pages sharing a common interest. </li></u...
Community Discovery <ul><li>Discovering web communities is similar to clustering. So, we must define the  similarity of tw...
Similarity of Web Pages <ul><ul><li>Co-citation : the similarity of A and B is measured by the number of pages cite both A...
The CT-algorithm <ul><li>The method from IBM Almaden Research Center,  Clever  search engine. </li></ul><ul><li>They call ...
Basic idea of CT <ul><li>Definition of Communities </li></ul><ul><li>dense directed bipartite sub graphs. </li></ul><ul><u...
Basic idea of CT <ul><li>Bipartite cores </li></ul><ul><ul><li>a complete bipartite subgraph with at least  i  nodes from ...
Basic idea of CT <ul><li>A bipartite core is the identity of a community. </li></ul><ul><li>To extract all the communities...
Weakness of CT <ul><li>The bipartite graph cannot suit all kinds of communities. </li></ul><ul><li>The density of the comm...
Summary  <ul><li>Web mining </li></ul><ul><ul><li>Content mining </li></ul></ul><ul><ul><li>Usage mining </li></ul></ul><u...
References -  Web Content Mining <ul><li>(HyPursuit)   Weiss, R., Velez, B., Sheldon, M., Nemprempre, C., Szilagyi, P., Du...
References -  Web Mining Overview <ul><li>Kosala, R. and Blockeel, H.  Web Mining Research: A Survey .  SIGKDD Exploration...
Upcoming SlideShare
Loading in...5
×

Data Mining Techniques

1,452

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,452
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
97
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Web Content Mining Semester 2, 2007
  • Web Content Mining Semester 2, 2007
  • Web Content Mining Semester 2, 2007
  • Web Content Mining Semester 2, 2007
  • Web Content Mining Semester 2, 2007
  • Web Content Mining Semester 2, 2007
  • Web Content Mining Semester 2, 2007
  • Web Content Mining Semester 2, 2007
  • Web Content Mining Semester 2, 2007
  • Web Content Mining Semester 2, 2007
  • Web Content Mining Semester 2, 2007
  • Web Content Mining Semester 2, 2007 First of all, I will discuss the background of collaborative filtering. Nowadays, the amount of information in the world is increasing far more quickly than our ability to process it. This scenario leads to the information filtering. The diagram demonstrates the whole process. First, collect users’ preferences. Then information filtering system uses these preference to divide the items into two categories. These items may be articles, movies, books, pictures and so on. The first category includes valuable items to recommend to users. The second category doesn’t meet users’ requirements. So we just discard it. Next, why people should do information filtering? It can save users’ time significantly and help users find the most valuable information. So it’s very useful.
  • Web Content Mining Semester 2, 2007 In information filtering, there is an assumption. Data about users’ preferences can be collected. This means that users’ preference should be expressed by data. This can be done explicitly. For example, just like newspapers often assign stars to films, users can assign some ratings to items to express their preferences. At the same time, this can be done implicitly. For example, by studying the time a user has spent on a message, or the user has written a reply or not. And so on. Here we just assume that user have assigned ratings to items to express their preferences and system can collect these data to create users’ profiles.
  • Web Content Mining Semester 2, 2007 Our research problem is : It can be seen very clear ly that to address this problem, we can use sliding windows. It means we just uses new data in current sliding window and discards old data. However, this approach will reduce data and aggravate sparsity. The sparsity refers to the fact that most users do not rate many items and hence the user-item matrix is very sparsity. Using only new data can make the sparsity more severe, thereby degrading the precision of recommendation system. So we propose a new algorithm time weight collaborative filtering algorithm.
  • Web Content Mining Semester 2, 2007 In information filtering, there is an assumption. Data about users’ preferences can be collected. This means that users’ preference should be expressed by data. This can be done explicitly. For example, just like newspapers often assign stars to films, users can assign some ratings to items to express their preferences. At the same time, this can be done implicitly. For example, by studying the time a user has spent on a message, or the user has written a reply or not. And so on. Here we just assume that user have assigned ratings to items to express their preferences and system can collect these data to create users’ profiles.
  • Web Content Mining Semester 2, 2007 Now we want to introduce the Applications of CF: Cf has achieved success both in research and practice. It is best known for the use on e-commerce web sites. For example at Amazon.com. A very famous online bookstore used the item-based CF approach to recommend some books to users. It can help sale. Here is the interface of recommendation system at Amazon.com. CF has been widely used in other area. For example, digital library, recommend TV show.
  • Web Content Mining Semester 2, 2007 These approaches can be divided into two categories. One category is model-based collaborative filtering. here I want to introduce it briefly. Model-based collaborative filtering uses the data to build a model and then used the model for predictions. These models may use Bayesian network , clustering or association rules. Most of these models are probabilistic models. Users are summarized into different clusters. Assume that in the same cluster the users have the same preferences. Then compute the probability of a user belonging to different clusters. Then select the most likely cluster. Recommend the preference of the user cluster to the user.
  • Web Content Mining Semester 2, 2007 Contrary to model-based algorithms, memory-based approaches are usually simpler and perform better when the training data is small. Memory-based algorithms use statistical techniques to find a set of neighbo urs and use the nearest neighbours to predict the user’s preferences. If the algorithm computes the similarity between different users and use a set of users as nearest neighbors to do recommendation. Then it is called user-based collaborative filtering If the algorithm computes the similarity between different items and use a set of items as nearest neighbors to do recommendation. Then it is called item-based collaborative filtering Compared to user-based algorithms, item-based algorithms can dramatically improve the scalability of collaborative filtering.
  • Web Content Mining Semester 2, 2007 Although cf has been widely used in the real world, there are still some challenges. There are some issues need to be addressed. First of all, we need to consider the prediction precision. It is a key issue. Every one wants to get precise recommendation. It influences the prevalence of CF. So a lot of research is focused on this element. it can be evaluated with MAE, MSE, NMAE. Mean Absolute Error measures the average absolute deviation between a predicted rating and the user’s true rating .MSE will square the error. NMAE will normalize the mae. Scalability issue It means if the number of users and items increased dramatically, how is the performance of the algorithm? Robustness issue: it means given some degree of noise in the data, how is the CF algorithm to provide accurate predictions. Sparsity issue : it is a well-known problem in collaborative filtering. It means the fact that most users do not rate many items and hence the user-item rating matrix is very sparse. Cold start issue: Cold start means the problem of making recommendations for new users or new items. Our research motivation is: We would like to focus on how to improve the accuracy of prediction
  • Web Structure Mining
  • Web Structure Mining
  • Web Structure Mining
  • Web Structure Mining
  • Web Structure Mining
  • Web Content Mining Semester 2, 2007
  • Web Content Mining Semester 2, 2007
  • Transcript of "Data Mining Techniques"

    1. 1. Data Mining - Web Mining Dr Heng Tao SHEN ©The University of Queensland, Brisbane Australia http://:www.itee.uq.edu.au/~shenht
    2. 2. Outline <ul><li>Web Content Mining </li></ul><ul><li>Web Usage Mining </li></ul><ul><li>Web Structure Mining </li></ul>
    3. 3. What is Web Mining? <ul><li>Web data mining - techniques to automatically discover and extract information from Web documents/services </li></ul><ul><li>Web mining research </li></ul><ul><ul><li>Database (DB) </li></ul></ul><ul><ul><li>Information retrieval (IR) </li></ul></ul><ul><ul><li>Machine learning (ML) </li></ul></ul><ul><ul><li>Natural language processing (NLP) </li></ul></ul>
    4. 4. Mining the Web <ul><li>Web is an information source for: </li></ul><ul><ul><li>Information services </li></ul></ul><ul><ul><ul><li>news, advertisements, consumer information, financial management, education, government, e-commerce, etc. </li></ul></ul></ul><ul><ul><li>Hyper-link information </li></ul></ul><ul><ul><li>User Behaviors (Access and usage information) </li></ul></ul><ul><ul><li>Web Site contents and Organization </li></ul></ul><ul><ul><li>Social media </li></ul></ul>
    5. 5. Web Mining: challenges <ul><li>Searches for </li></ul><ul><ul><li>Regularity and dynamics of Web contents </li></ul></ul><ul><ul><li>Web user access patterns </li></ul></ul><ul><ul><li>Web structures </li></ul></ul><ul><li>Problems </li></ul><ul><ul><li>The “abundance” problem ( rich data but poor information ) </li></ul></ul><ul><ul><li>Limited coverage of the Web: hidden Web sources, majority of data in DBMS </li></ul></ul><ul><ul><li>Dynamic and semi-structured </li></ul></ul><ul><ul><li>Limited on keyword-oriented search </li></ul></ul><ul><ul><li>Limited customization to individual users </li></ul></ul>
    6. 6. A Taxonomy of Web Mining Web Mining Web Structure Mining Web Content Mining Web Page Content Mining Search Result Mining Web Usage Mining General Access Pattern Tracking Customized Usage Tracking Authority & Hub Pages Ranking Web Community Discovering
    7. 7. Web Content Mining <ul><li>Discovery of useful information from web contents / data / documents </li></ul><ul><ul><li>Web data contents: text, image, audio, video, </li></ul></ul><ul><li>metadata and hyperlinks. </li></ul><ul><li>Information Retrieval View ( Structured + Semi-Structured) </li></ul><ul><ul><li>Assist / Improve information finding </li></ul></ul><ul><ul><li>Filtering Information to users on user profiles </li></ul></ul><ul><ul><li>Information extraction </li></ul></ul>
    8. 8. Issues in Web Content Mining <ul><li>Developing intelligent tools for IR </li></ul><ul><ul><li>Finding keywords and key phrases </li></ul></ul><ul><ul><li>Discovering grammatical rules and collocations </li></ul></ul><ul><ul><li>Hypertext classification/categorization </li></ul></ul><ul><ul><li>Extracting key phrases from text documents </li></ul></ul><ul><ul><li>Learning extraction models/rules </li></ul></ul><ul><ul><li>Hierarchical clustering </li></ul></ul><ul><ul><li>Predicting (words) relationship </li></ul></ul>
    9. 9. Web Content Mining - implementation <ul><li>Information Filtering/Categorization </li></ul><ul><ul><li>Collaborative Filtering </li></ul></ul><ul><li>Personalized Web Agents </li></ul><ul><ul><li>Web Wrappers </li></ul></ul>
    10. 10. Information Filtering/Categorization <ul><li>Using various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them. </li></ul><ul><ul><li>HyPursuit: uses semantic information embedded in link structures and document content to create cluster hierarchies of hypertext documents, and structure an information space </li></ul></ul><ul><ul><li>BO (Bookmark Organizer): combines hierarchical clustering techniques and user interaction to organize a collection of Web documents based on conceptual information </li></ul></ul>
    11. 11. How do We Find Similar Web Pages? <ul><li>Content based approach </li></ul><ul><li>Structure based approach </li></ul><ul><li>Combing both content and structure approach </li></ul>
    12. 12. HyPursuit: Similarity Functions for Web Pages with Hyperlinks The hyperlink similarity between two Hypertext document: Where Common Descendants: Common Ancestors: Shortest path length between documents: W d , W a , and W s are damping factors for normalization. spl xy  length of a shortest path between d x and d y . spl xy z  length of a shortest path between d x and d y not travelling d z
    13. 13. How can we recommend contents to users? <ul><li>Recommender systems: </li></ul><ul><ul><li>Content based recommender algorithms </li></ul></ul><ul><ul><li>Collaborate Filtering algorithm </li></ul></ul>
    14. 14. Background: Information Filtering <ul><li>Why IF? </li></ul><ul><ul><li>Save users’ time </li></ul></ul><ul><ul><li>Find the “interesting” items </li></ul></ul>Information Filtering system user preferences items items items Unwanted information Recommended items recommendation
    15. 15. Assumption <ul><li>Data about users’ preferences can be collected. </li></ul>User profiles user preferences on items
    16. 16. Definition of CF problem <ul><li>Given a dataset D as a tuple <Ui,Ij,Oij> </li></ul><ul><li>Where Ui identifies the i-th user of the system, </li></ul><ul><li>Ij identifies the j-th items of the system, </li></ul><ul><li>Oij represents the i-th user’s opinion on the j-th item </li></ul><ul><li>Find a list of k recommended items for each user. </li></ul>
    17. 17. A Mapping of Two High-Dimensional Spaces Q1: For a given kind of items, what kind of customers would like it? Q2: For certain type of customers, what kind of items do they like? User Preferences Item Space User Space
    18. 18. Applications of Collaborative filtering <ul><ul><li>Digital library [ Seikyung Jung,CIKM04 ] </li></ul></ul><ul><ul><li>Recommend TV show[ Kamal Ali,KDD04 ] </li></ul></ul>E-commerce
    19. 19. Types of Recommendation Methods: Model-based or Memory-based collaborative filtering <ul><ul><li>Model-based </li></ul></ul>Model users User preferences recommendation
    20. 20. M emory -based collaborative filtering <ul><li>User-based recommendation </li></ul><ul><ul><li>Find similar users as the nearest neighbors to recommend. </li></ul></ul><ul><li>Item-based recommendation </li></ul><ul><ul><li>Find similar items as the nearest neighbors to recommend. </li></ul></ul>
    21. 21. Challenges to Collaborative filtering <ul><li>Prediction Precision </li></ul><ul><li>Scalability : the number of users and items increase dramatically, how is the performance of the algorithm? </li></ul><ul><li>Robustness : given some degree of noise in the data, how is the algorithm to provide accurate prediction? </li></ul><ul><li>Sparsity : the user-item rating matrix is very sparse </li></ul><ul><li>Cold start: how to make recommendations for new users or new items </li></ul>
    22. 22. W eb U sage M ining <ul><li>Web Log Mining </li></ul><ul><ul><li>Pre-processing </li></ul></ul><ul><ul><li>Pattern mining </li></ul></ul><ul><ul><li>Pattern analysis </li></ul></ul>
    23. 23. W eb U sage M ining - Applications <ul><li>Target potential customers for e-commerce </li></ul><ul><li>Enhance the quality and delivery of Internet information services </li></ul><ul><li>Improve web server performance (Load Balancing) </li></ul><ul><li>Identify potential prime advertisement locations </li></ul><ul><li>Facilitates personalization/adaptive sites </li></ul><ul><li>Improve site design </li></ul><ul><li>Fraud/intrusion detection </li></ul><ul><li>Predict user’s actions (allows pre-fetching) </li></ul>
    24. 24. W eb U sage M ining - Outcome <ul><li>Association rules </li></ul><ul><li>– Find pages that are often viewed together </li></ul><ul><li>Clustering </li></ul><ul><li>– Cluster users based on browsing patterns </li></ul><ul><li>– Cluster pages based on content </li></ul><ul><li>Classification </li></ul><ul><li>– Relate user attributes to patterns </li></ul>
    25. 25. W eb U sage M ining - Phases <ul><li>Three distinctive phases: </li></ul><ul><ul><li>preprocessing, </li></ul></ul><ul><ul><li>pattern discovery </li></ul></ul><ul><ul><li>pattern analysis </li></ul></ul>
    26. 26. Phase 1: Pre-processing <ul><li>Converts the raw data into the data abstraction necessary for the further applying the data mining algorithm </li></ul><ul><ul><li>Mapping the log data into relational tables before an adapted data mining technique is performed. </li></ul></ul><ul><ul><li>Using the log data directly by utilizing special pre-processing techniques. </li></ul></ul>
    27. 27. Raw data – Web log <ul><li>Click stream : a sequential series of page view request </li></ul><ul><li>User session : a delimited set of user clicks (click stream) across one or more Web servers. </li></ul><ul><li>Server session (visit) : a collection of user clicks to a single Web server during a user session. </li></ul><ul><li>Episode : a subset of related user clicks that occur within a user session. </li></ul>
    28. 28. Phase 2: Pattern Discovery <ul><li>Pattern Discovery uses techniques such as statistical analysis, association rules, clustering, classification, sequential pattern, dependency Modeling. </li></ul>
    29. 29. Phase 3: Pattern Analysis <ul><li>A process to gain Knowledge about how visitors use Website in order to </li></ul><ul><ul><li>P revent disorientation and help designers to place important information/functions exactly where the visitors look for and in the way users need it. </li></ul></ul><ul><ul><li>B uild up adaptive Website server </li></ul></ul>
    30. 30. Web Structure Mining <ul><li>To discover the link structure of the hyperlinks at the inter-document level to generate structural summary about Websites and Web pages. </li></ul><ul><ul><li>Direction 1: based on the hyperlinks, categorizing the Web pages and generated information. </li></ul></ul><ul><ul><li>Direction 2: discovering the structure of Web document itself. </li></ul></ul><ul><ul><li>Direction 3: discovering the nature of the hierarchy or network of hyperlinks in the Website of a particular domain. </li></ul></ul>
    31. 31. Web Structure Mining - Applications <ul><li>Web pages categorization/ranking </li></ul><ul><li>Communities discovery </li></ul><ul><li>Schema Discovery in Semi-structured Environment </li></ul>
    32. 32. Well-known Methods <ul><li>HITS (Topic distillation) </li></ul><ul><li>PageRank (Ranking web pages used by Google) </li></ul><ul><li>Algorithms in Cyber-community </li></ul>
    33. 33. HITS <ul><li>H yperlink I nduced T opic S earch. </li></ul><ul><li>A simple approach by finding hubs and authorities. </li></ul><ul><li>View web as a directed graph . </li></ul><ul><li>Assumption: if document A has hyperlink to document B, then the author of document A thinks that document B contains valuable information. </li></ul>
    34. 34. HITS: Main Idea <ul><li>Concerned with the identification of the most authoritative , or definitive , Web pages on a broad-topic. </li></ul><ul><li>Focused on only one topic. </li></ul><ul><li>Viewing the Web as a graph. </li></ul><ul><li>A purely link structure-based computation, ignoring the textual content. </li></ul>
    35. 35. HITS: Hubs and Authority <ul><li>Hub : web page links to a collection of prominent sites on a common topic. </li></ul><ul><li>Authority : Pages that link to a collection of authoritative pages on a broad topic; web page pointed to by hubs. </li></ul><ul><li>Mutual Reinforcing Relationship : a good authority is a page that is pointed to by many good hubs, while a good hub is a page that points to many good authorities. </li></ul>
    36. 36. HITS: Two Main Steps <ul><li>A sampling component, which constructs a focused collection of several thousand web pages likely to be rich in relevant authorities. </li></ul><ul><li>A weight-propagation component, which determines numerical estimates of hub and authority weights by an iterative procedure. </li></ul><ul><li>As the result, pages with highest weights are returned as hubs and authorities for the research topic. </li></ul>
    37. 37. HITS: Drawbacks <ul><li>Limit On Narrow Topics </li></ul><ul><ul><li>Not enough authoritative pages </li></ul></ul><ul><ul><li>Frequently returns resources for a more general topic </li></ul></ul><ul><ul><li>adding a few edges can potentially change scores considerably </li></ul></ul><ul><li>Topic Drifting </li></ul><ul><li>- Appear when hubs discuss multiple </li></ul><ul><li>topics </li></ul>
    38. 38. PageRank <ul><li>Introduced by Brin and Page (1998). </li></ul><ul><li>Mine hyperlink structure of web to produce ‘global’ importance ranking of every web page. </li></ul><ul><li>Used in Google Search Engine. </li></ul><ul><li>Web search result is returned in the rank order. </li></ul><ul><li>Treats link as like academic citation. </li></ul><ul><li>Assumption: Highly linked pages are more ‘ important’ than pages with a few links. </li></ul>
    39. 39. PageRank: Main Idea <ul><li>A page has a high rank if the sum of the ranks of its back-links is high. </li></ul><ul><li>Google utilizes a number of factors to rank the search results: </li></ul><ul><ul><li>proximity, anchor text, page rank </li></ul></ul><ul><li>The benefits of Page Rank are the greatest for underspecified queries, example: ‘Stanford University’ query using Page Rank lists the university home page the first. </li></ul>
    40. 40. What is Web Community? <ul><li>A cyber community on the web is a group of web pages sharing a common interest. </li></ul><ul><ul><li>Eg. A group of web pages interested in data-mining. </li></ul></ul><ul><li>Main properties: </li></ul><ul><ul><li>Pages in the same community should be similar to each other in contents. </li></ul></ul><ul><ul><li>The pages in one community should differ from the pages in another community. </li></ul></ul><ul><ul><li>Similar to cluster. </li></ul></ul>
    41. 41. Community Discovery <ul><li>Discovering web communities is similar to clustering. So, we must define the similarity of two pages. </li></ul>
    42. 42. Similarity of Web Pages <ul><ul><li>Co-citation : the similarity of A and B is measured by the number of pages cite both A and B. </li></ul></ul><ul><ul><li>Bibliographic coupling : the similarity of A and B is measured by the number of pages cited by both A and B. </li></ul></ul>Page A Page B Page A Page B
    43. 43. The CT-algorithm <ul><li>The method from IBM Almaden Research Center, Clever search engine. </li></ul><ul><li>They call their method Communities Trawling (CT). </li></ul><ul><li>They implemented it on the graph of 200 millions pages, it worked very well. </li></ul>
    44. 44. Basic idea of CT <ul><li>Definition of Communities </li></ul><ul><li>dense directed bipartite sub graphs. </li></ul><ul><ul><li>Bipartite graph: Nodes are partitioned into two sets, F and C. </li></ul></ul><ul><ul><li>Every directed edge in the graph is directed from a node u in F to a node v in C. </li></ul></ul><ul><ul><li>dense if many of the possible edges between F and C are present. </li></ul></ul>Fans Centers F C
    45. 45. Basic idea of CT <ul><li>Bipartite cores </li></ul><ul><ul><li>a complete bipartite subgraph with at least i nodes from F and at least j nodes from C. </li></ul></ul><ul><ul><li>i and j are tunable parameters. </li></ul></ul><ul><ul><li>A (i, j) Bipartite core. </li></ul></ul><ul><li>Every community have such a core with a certain i and j. </li></ul>INFS4203 / INFS7203 Data Mining A (i=3, j=3) bipartite core
    46. 46. Basic idea of CT <ul><li>A bipartite core is the identity of a community. </li></ul><ul><li>To extract all the communities is to enumerate all the bipartite cores on the web. </li></ul><ul><li>Author invent an efficient algorithm to enumerate the bipartite cores. Its main idea is iterate pruning -- elimination-generation pruning. </li></ul>INFS4203 / INFS7203 Data Mining
    47. 47. Weakness of CT <ul><li>The bipartite graph cannot suit all kinds of communities. </li></ul><ul><li>The density of the community is hard to adjust. </li></ul>INFS4203 / INFS7203 Data Mining
    48. 48. Summary <ul><li>Web mining </li></ul><ul><ul><li>Content mining </li></ul></ul><ul><ul><li>Usage mining </li></ul></ul><ul><ul><li>Structure mining </li></ul></ul><ul><li>Next week: </li></ul><ul><ul><li>Time Series Mining </li></ul></ul>
    49. 49. References - Web Content Mining <ul><li>(HyPursuit) Weiss, R., Velez, B., Sheldon, M., Nemprempre, C., Szilagyi, P., Duda, A. and Gifford, D. ` HyPursuit : A hierarchical network search engine that exploits contentlink hypertext clustering' in Proc. of the ACM Hypertext'96 (Washington, DC. March, 1996).  http://www.psrg.lcs.mit.edu/ftpdir/ papers/  </li></ul><ul><li>    (CF Approach) Sarwar B. Karypis G., Konstan J., and Riedl J., &quot;Item-Based Collaborative Filtering Recommendation Algorithms&quot;, Proceedings of ACM 10th WWW Conference, Hong Kong, May 2001, 285-295.  </li></ul>
    50. 50. References - Web Mining Overview <ul><li>Kosala, R. and Blockeel, H. Web Mining Research: A Survey . SIGKDD Explorations , 2(1):1-15, 2000 </li></ul><ul><li>J. Srivastava, R. Cooley, M. Deshpande, Pang-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data , ACM SIGKDD Explorations, Vol. 1, Issue 2, 2000. </li></ul><ul><li>(Community Trawling) S.R. Kumar et al., &quot; Trawling Emerging-Cyber-Communities Automatically ,&quot; Proc.8th World Wide Web Conf., Elsevier Science, Amsterdam, 1999, pp. 403-415. </li></ul><ul><li>(HITS Algorithm) Kleinberg J.M. &quot; Authoritative Sources in a Hyperlinked Environment &quot;, Journal of ACM, 46(5), September 1999, pp604-632. </li></ul><ul><li>(Page Rank)  S. Brin, L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine . Proceedings of the Seventh World Wide Web Conference, Brisbane, Australia, April 1998. </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×