Data Mining and the Web_Past_Present and Future


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Data Mining and the Web_Past_Present and Future

    1. 1. Data Mining and the Web: Past, Present and Future Kyuseok Shim Bell Laboratories [email_address] S. Seshadri Bell Laboratories [email_address] Rajeev Rastogi Bell Laboratories [email_address] Minos N. Garofalakis Bell Laboratories [email_address]
    2. 2. Agenda <ul><li>Today www search tools are plagued by four problems. </li></ul><ul><ul><li>abundance problem </li></ul></ul><ul><ul><li>limited coverage of the Web </li></ul></ul><ul><ul><li>a limited query interface </li></ul></ul><ul><ul><li>limited customization to individual users. </li></ul></ul><ul><li>Data Mining Techniques </li></ul><ul><ul><li>Association rules, Classification, Cluster…etc. </li></ul></ul><ul><li>Web Mining Techniques </li></ul><ul><ul><li>Hubs and Authorities </li></ul></ul><ul><ul><li>Building Web Knowledge Base </li></ul></ul><ul><ul><li>Mining the Structure of Web Documents </li></ul></ul><ul><li>Web Mining Research Issues </li></ul><ul><ul><li>Mining Web Structure. </li></ul></ul><ul><ul><li>Improving Customization. </li></ul></ul><ul><ul><li>Extracting Information from Hypertext Documents </li></ul></ul>
    3. 3. Introduction <ul><li>Problems: </li></ul><ul><ul><li>Web is a huge, diverse and dynamic collection of interlinked hypertext documents. </li></ul></ul><ul><ul><li>Except for hyperlinks , the Web is largely unstructured. </li></ul></ul><ul><ul><li>The contents of many internet sources are hidden behind search interfaces </li></ul></ul><ul><ul><li>99% of the information on the Web is of no interest to 99% of the people. </li></ul></ul><ul><ul><ul><li>abundance problem </li></ul></ul></ul><ul><ul><ul><ul><li>the phenomenon of hundreds of irrelevant documents being returned in response to a search query. </li></ul></ul></ul></ul><ul><ul><ul><li>limited coverage of the Web </li></ul></ul></ul><ul><ul><ul><li>A limited query interface </li></ul></ul></ul><ul><ul><ul><ul><li>based on syntactic keyword-oriented search </li></ul></ul></ul></ul><ul><ul><ul><li>limited customization to individual users </li></ul></ul></ul>
    4. 4. Data Mining Techniques – Association Rules <ul><li>A useful mechanism for discovering correlations among items belonging to customer transactions in a market basket database. </li></ul><ul><li>Rule form: “ Body  ead [support, confidence]”. </li></ul><ul><li>Find all the rules X & Y  Z with minimum confidence and support </li></ul><ul><ul><li>support , s , that a transaction contains {X  Y  Z} </li></ul></ul><ul><ul><li>confidence , c , that a transaction having {X  Y} also contains Z </li></ul></ul>
    5. 5. Association Rules example <ul><li>For rule A  C : </li></ul><ul><ul><li>support = support({ A  C }) = 50% </li></ul></ul><ul><ul><li>confidence = support({ A  C })/support({ A }) = 66.6% </li></ul></ul>Min. support 50% Min. confidence 50%
    6. 6. Association Rules – Apriori Algorithm <ul><li>The Apriori algorithm is the most popular algorithm for computing association rules. </li></ul><ul><li>The Apriori principle: </li></ul><ul><ul><li>Any subset of a frequent itemset must be frequent </li></ul></ul><ul><li>Pseudo-code : </li></ul><ul><ul><ul><li>C k : Candidate itemset of size k </li></ul></ul></ul><ul><ul><ul><li>L k : frequent itemset of size k </li></ul></ul></ul><ul><ul><ul><li>L 1 = {frequent items}; </li></ul></ul></ul><ul><ul><ul><li>for ( k = 1; L k !=  ; k ++) do begin </li></ul></ul></ul><ul><ul><ul><li>C k+1 = candidates generated from L k ; </li></ul></ul></ul><ul><ul><ul><li>for each transaction t in database do </li></ul></ul></ul><ul><ul><ul><ul><li>increment the count of all candidates in C k+1 that are contained in t </li></ul></ul></ul></ul><ul><ul><ul><li>L k+1 = candidates in C k+1 with min_support </li></ul></ul></ul><ul><ul><ul><li>end </li></ul></ul></ul><ul><ul><ul><li>return  k L k ; </li></ul></ul></ul>
    7. 7. Apriori Algorithm Example Database D Scan D C 1 L 1 L 2 C 2 C 2 Scan D C 3 L 3 Scan D
    8. 8. Data Mining Techniques – Classification <ul><li>The goal is to induce a model or description for each class in terms of the attributes. </li></ul><ul><li>Classifiers are useful in the Web context to build taxonomies and topic hierarchies on Web pages. </li></ul><ul><li>Two step process </li></ul><ul><ul><li>Model construction </li></ul></ul><ul><ul><li>Use the Model in prediction </li></ul></ul>
    9. 9. Classification Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Training Data Classifier (Model)
    10. 10. Classification Process (2): Use the Model in Prediction (Jeff, Professor, 4) Tenured? Classifier Testing Data Unseen Data
    11. 11. Classification <ul><li>Supervised learning </li></ul><ul><ul><li>The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations </li></ul></ul><ul><ul><li>New data is classified based on the training set </li></ul></ul><ul><li>Decision trees classifiers are popular since they are easily interpreted by humans and are efficient to build. </li></ul><ul><ul><li>Building phase </li></ul></ul><ul><ul><li>Pruning phase </li></ul></ul>
    12. 12. Data Mining Techniques – Cluster <ul><li>Clustering is a useful technique for discovering interesting data distributions and patterns in the underlying data. </li></ul><ul><li>A collection of data objects </li></ul><ul><ul><li>Similar to one another within the same cluster (Maximum of Intraclass similarity) </li></ul></ul><ul><ul><li>Dissimilar to the objects in other clusters (Minimum of Interclass similarity) </li></ul></ul><ul><li>Clustering is unsupervised classification : no predefined classes </li></ul><ul><li>Common method </li></ul><ul><ul><li>Partitioning method </li></ul></ul><ul><ul><li>Hierarchical method </li></ul></ul><ul><ul><li>Density-based method </li></ul></ul><ul><ul><li>Grid-based method </li></ul></ul><ul><ul><li>Model-based method </li></ul></ul>
    13. 13. Web Mining Techniques <ul><li>Hubs and Authorities </li></ul><ul><ul><li>J. Kleinberg, 1999 </li></ul></ul><ul><ul><li>To discover the underlying Web structure, and analyze the link topology. </li></ul></ul><ul><ul><li>Authorities are highly-referenced pages on the topic. </li></ul></ul><ul><ul><li>Hubs are pages that “point” to many of the authorities </li></ul></ul><ul><ul><li>Hubs and authorities thus exhibit a strong mutually reinforcing relationship. </li></ul></ul><ul><li>Building Web Knowledge Base </li></ul><ul><ul><li>By enumerating and organizing all web occurrences of chosen subgraphs. </li></ul></ul><ul><li>Mining the Structure of Web Document </li></ul><ul><ul><li>XML </li></ul></ul>
    14. 14. Web Mining Research Issues <ul><ul><li>Mining Web Structure. </li></ul></ul><ul><ul><ul><li>These approaches only take into account hyperlink information and pay little or no attention to the content of Web pages. </li></ul></ul></ul><ul><ul><li>Improving Customization. </li></ul></ul><ul><ul><ul><li>Providing users with pages, sites and advertizements that are of interest to them. </li></ul></ul></ul><ul><ul><ul><li>Automatically optimize their design and organization based on observed user patterns. </li></ul></ul></ul><ul><ul><li>Extracting Information from Hypertext Documents </li></ul></ul><ul><ul><ul><li>Complicated, because HTML provide very little semantic information. </li></ul></ul></ul><ul><ul><ul><li>XML may be possible to transform the entire Web into one unified database. </li></ul></ul></ul>
    15. 15. Reference <ul><li>Data Mining: Concepts and Techniques — Slides for Textbook — ©Jiawei Han and Micheline Kamber </li></ul><ul><li>Intelligent Database Systems Research Lab </li></ul><ul><li>School of Computing Science </li></ul><ul><li>Simon Fraser University, Canada </li></ul><ul><li> </li></ul>
    16. 16. Q & A Thanks! ^_^