SEG 5120 – Web Mining


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

SEG 5120 – Web Mining

  1. 1. SEG 5120 – Web Mining Chris Yang
  2. 2. Web Mining Data mining has been applied traditionally to databases (structured data) Information on the Web are unstructured or semi- structured Web pages Product catalogs Comments News stories Company information Web mining aims to extract/mine useful knowledge from the Web
  3. 3. Multidisciplinary Research Web mining is a multidisciplinary research area, involving Data mining Machine learning Information retrieval Natural language processing Multimedia Statistics
  4. 4. Challenges of Web Mining Huge amount of information but easy to access Coverage of information is very wide and diverse It is the first source of information seeking in all kinds of topics for almost everyone (e.g. weather, news, products, vocabulary, etc.) Includes all types of information Structured tables, text, image, audio, image, etc. Semi-structured with HTML code Hyperlinks among pages within a Web site and across sites are available Information is redundant Same piece of information or its variants appear with different URLs
  5. 5. Noisy A Web page contains a mixture of information Main contents, advertisements, navigation panels, copyright notices, logos, etc. Some information repeated on a set of Web pages Web can be classified into surface Web or deep Web Surface Web: static Web pages stored in Web servers Deep Web: dynamic Web pages that are generated from databases by parameterized queries
  6. 6. Web is not only disseminating information but also provides a platform to deliver services Users may submit a request with input parameters to a server to perform operations Web services Web is dynamic. Information on the Web changes constantly Monitoring the changes and identifying new patters are challenging tasks Web is a virtual society It is more than data, information and services People, organization and automatic applications around the world interact with one another
  7. 7. Growth of Web Previously Internet services and data sources such as Gopher, FTP and Usenet, are either ported to or accessible from the Web The growth of government information on the Web has been tremendous Digital libraries are now accessible from the Web Companies are transforming their business and services electronically Company database that previously resided in the legacy systems are ported to or accessible from the Web Applications and systems are migrated to the Web although some of the Web data are hidden (hidden Web)
  8. 8. Users Problems Problems that information users encounters Results in Web searching usually have low precision and recall Accurate indexing is impossible (low precision) Unable to index all Web pages (low recall) Lack of personalization Presentation of information does not match with user preferences Hard to extract useful knowledge Knowledge from the Web supports decision making but useful knowledge is difficult to extract Customization Organization wants to customize their information or products to their intended customers Information does not reach the target users
  9. 9. Subtasks in Web Mining Web mining includes 4 subtasks (Etzioni, 1996, Kosala and Blockeel, 2000): Resource discovery Locating document and services on the Web Search engines: Yahoo!, Google, AltaVista, … Retrieving data that is online or offline from the text sources available on the Web E.g. electronic newsletter, electronic newswire, newsgroups, etc. Identifying sources that originally were not accessible from the Web but are accessible now
  10. 10. Information extraction Extracting specific information from Web resources Removing stop words, stemming, finding key phrases, transforming representation Using wrappers to access the resource and parse its response Generalization Uncovering patterns at individual Web sites and across multiple sites Analysis Validation and/or interpretation of the mined patterns (Kosala and Blockeel, 2000)
  11. 11. Web Mining and Information Retrieval Information retrieval (IR) is the automatic retrieval of relevant documents Indexing text, searching documents, modeling, document classification, user interfaces, data visualization, filtering Web mining can be considered as part of Web IR process Web document classification
  12. 12. Web Mining and Information Extraction Information extraction (IE) transforms a collection of documents (with support of IR techniques) into information that is more readily digested and analyzed (Cowie and Lehnert, 1996) IE interested in the structure or representation of a document, a finer granularity level than IR Web mining can be considered as part of Web IE Extraction patterns or rules for Web documents using machine learning or data mining techniques The result of Web IE is in the form of a structured database or a compression or summary original Web documents
  13. 13. Categories of Web Mining Web mining can be classified into 3 categories (Kosala and Blockeel, 2000) Web usage mining Discovering user access patterns from Web usage logs E.g. Web server access logs, proxy server logs, browser logs, user profiles, registration data, user sessions or transactions, cookies, user queries, bookmark data, mouse clicks and scrolls, any other interaction data Applications: learning user profiles, identifying associate terms, Web traffic control Web structure mining Discovering useful knowledge from the structure of hyperlinks (in-links and out-links) Applications: Webpage ranking in Google Web content mining Discovering use information and knowledge from content in Web pages Applications: document categorization, sentiment classification Web mining techniques may use a combination of the above categories of techniques
  14. 14. (Kosala and Blockeel, 2000)
  15. 15. Web Usage Mining Predict user behavior while user interacts with the Web Two approaches Map usage data of the Web server into relational tables before an adapted data mining technique is performed Uses log data directly Typical challenges Distinguishing among unique users, severs sessions, episodes, etc. Applications Learning a user profile or user modeling in adaptive interfaces – personalization Learning user navigation patterns to improve system, business intelligence and usage utilization
  16. 16. Web Structure Mining Investigate the structure of hyperlinks within the Web – inter- document structure Inspired by research work in social networks and citation analysis Discover specific types of pages (hubs and authorities) based on the in-links and out-links Prominent algorithms in modeling Web topology HITS and PageRank Calculate the quality rank and relevance of Web pages Other applications Web page categorization Discovering micro communities on the Web Measuring the completeness of the Web sites by measuring the frequency of local links that reside in the same server, Measuring the replication of Web documents across the Web warehouses, discovering the nature of the hierarchy of hyperlinks
  17. 17. Web Content Mining IR View for Unstructured Documents Bags of words or vector representation takes single words found in the training corpus as features Sequence in which the words occur is ignored Statistic about single words is used
  18. 18. Feature selection Removing case, punctuation, infrequent words, and stop words Information gain, mutual information, cross entropy, odd ratios (Mladenic and Grobelnik, 1999) Latent semantic indexing (LSI) transforms original document vector to a lower dimensional space by analyzing the correlational structure of terms in the document collection Word positions, n-grams representation, phrases, document concept categories (ontology), hypernyms (linguistic term for the “is a” relationship), name entities Relational representation (first order logic) – relationships between different words and their positions E.g. Word X is to the left of Word Y in Sentence J. Other IR techniques Text classification, event detection and tracking, finding extraction patterns or rules NLP techniques IR View for Semi-structured Documents Making use of hypertext and hyperlinks
  19. 19. Database View on Web Content Mining Three classes of tasks (Florescu et al., 1998) Modeling and querying the Web Information extraction and integration Web site construction and restructuring Database view tries to infer the structure of the Web site or to transform a Web site to become a database for better information management and querying on the Web. How? Finding the schema of Web documents Building a Web warehouse or a Web knowledge base or a virtual database Semi-structure data refers to data with some structure but not rigid
  20. 20. Object exchange model (OEM) represents semi-structured data by a labeled graph (Abiteboul et al., 1997) Object as vertices, labels on the edges Each object is identified by an object identifier (oid) A value that is either atomic (integer, string, gif, html, etc.) or complex (a set of object references), denoted as a set of (label, oid) pairs Techniques – site-specific wrappers or parsers for hypertext documents Schema Extraction or DataGuides (Goldman and Widom, 1997) Structural summary of semi-structured data (always approximated) Some other works deal with finding frequent substructures (sub- schema) Multi-layered database (MLDB) – each layer is obtained by generalization on lower layers and use a special purpose query language for Web mining to extract some knowledge