Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building an Intelligent Web:            Theory and Practice            Th       d P ti                 Pawan Lingras      ...
Discipline                                                  Mathematics and Statistics                            Manageme...
Information Retrieval
Create a list of words                        Remove stop words                            Stem words               Calcul...
Data Mining has emerged as one of the most exciting and dynamicfields in computing science. The driving force for data min...
1              0.75              0 75    ecision               0.5  Pre              0.25                0                ...
Semantic Web
Semantic WebThe layer language model    (Berners-Lee, 2001; Broekstra et al, 2001)
<h1>Student Service Centre</h1>Welcome to the home page of the Student Service Centre.The centre is located in the main bu...
<organization>     <serviceOffered>Admission</serviceOffered>     <organizationName>Student Service Centre</organizationNa...
Figure 3.4 Representing classes and instances (Noy et al., 2001)
Edward                 lecturer   @name                                                    Bunker                         ...
Queries 1 and 2                                                      Edward                 lecturer    @name             ...
Queries 3 and 4                                                    Edward                 lecturer   @name                ...
<?xml version="1.0"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"                       p //         g...
A RDF model for automobiles
<?xml version="1.0"?><rdf:RDF  xmlns:rdf http://www.w3.org/1999/02/22 rdf syntax ns#  xmlns:rdf="http://www.w3.org/1999/02...
<?xml version="1.0"?><topicMap id="tmrf"            xmlns       = http://www.topicmaps.org/xtm/1.0/            xmlns:xlink...
Classification and Association
Data Preparation•   Database Theory•   SQL•   Data Transformation•   http://www.ecn.purdue.edu/KDDCUP/data/
Classification• Find a rule, a formula, or black box classifier for  organizing data into classes.   – Classify clients re...
Classification• ID3 Algorithm  – Numerical Illustration  – Application to a Small E commerce Dataset                      ...
Association• Market basket analysis  – determine which things go together• Transactions might reveal that  – customers who...
Association• Apriori Algorithm• D  Demonstration f an E-commerce           t ti for  E  Application
Clustering
Clustering• Breaks a large database into different  subgroups or clusters• Unlike classification there are no  predefined ...
543210    0   1   2   3   4   5
Statistical Methods•   k – means    – Numerical Example    – Implementation      •   Data Preparation      •   Clustering•...
Neural Network Based Approaches• Kohonen Self Organising Maps  – Numerical Demonstration  – Application to Web Data Collec...
Clustering of customers
Web Mining                 Web Content                 W bC t t                      Web Structure                        ...
Web Usage Mining
High level web usage mining process       (Srivastava et al., 2000)       (S i   t     t l
Applications of web usage mining (Romanko, 2006; Srivastava et al., 2000)
140.14.6.11 - pawan [06/Sep/2001:10:46:07 -0300] "GET /s.htm HTTP/1.0" 200 2267140.14.7.18 - raj [06/Sep/2001:11:23:53 -03...
Clustering exercise
Classification exercise                  Channel                    Recall   Precision                  Finance           ...
Association exercise     News          Minimum Maximum Mean        Standard     Section       Requests Requests           ...
The association mining showed strong associations between the following pairs:   Politics and Society   Politics and Int...
Sequence Pattern Analysis of        Web L        W b Logs
Web Content Mining
Data Collection•   Web Crawlers•   Public    P blic Domain Web Cra lers                      Crawlers•   An Implementation...
Architecture of a search engine        (Romanko, 2006)
Other topics in Web Content Mining•   Search Engines    – How to prepare for and setup a search      engine    – Types and...
Web Structure Mining
0/10:    The site or page is probably new.3/10:    The site is perhaps new, small in size and has very little or no worthw...
http://www.iprcom.com/papers/pagerank/   p        p         p p    p g
Index quality for different search engines         (Henzinger, et al., 1999)
Index quality per page for different search engines              (Henzinger, et al., 1999)
Page                         Freq.     Freq.     Rank                                                 Walk2     Walk1     ...
Site               Frequency       Frequency        Rank                                Walk 2          Walk 1         Wal...
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
Upcoming SlideShare
Loading in …5
×

Building an Intelligent Web: Theory & Practice

5,545 views

Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Building an Intelligent Web: Theory & Practice

  1. 1. Building an Intelligent Web: Theory and Practice Th d P ti Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India
  2. 2. Discipline Mathematics and Statistics Management Computer Science Chapters 1 – 8 excluding shaded portion related to Research Graduate Research Graduate mathematics and implementation. Information Chapters 1 – 8 excluding Chapters 2, 4 – 8 excludingComplete Book Web Mining shaded portion related to shaded portion related to Retrieval implementation. implementation. Chapters 1, 2, 3, 7 and 8 Chapters 4 - 8
  3. 3. Information Retrieval
  4. 4. Create a list of words Remove stop words Stem words Calculate frequency of each stemmed wordFigure 2.1 Transforming text document to a weighted list of keywords
  5. 5. Data Mining has emerged as one of the most exciting and dynamicfields in computing science. The driving force for data mining isthe presence of petabyte-scale online archives that potentiallycontain valuable bits of information hidden in them. Commercialenterprises h t i have bbeen quick t i k to recognize th i the value of thi l f thisconcept; consequently, within the span of a few years, thesoftware market itself for data mining is expected to be in excessof $10 billion. Data mining refers to a family of techniques usedto detect interesting nuggets of relationships/knowledge in data.While the theoretical underpinnings of the field have been aroundfor quite some time (in the form of pattern recognition,statistics, data analysis and machine learning), the practice anduse of these techniques have been largely ad-hoc. With theavailability of large databases to store manage and assimilate store,data, the new thrust of data mining lies at the intersection ofdatabase systems, artificial intelligence and algorithms thatefficiently analyze data. The distributed nature of severaldatabases, their size and the high complexity of many techniquespresent interesting computational challenges.
  6. 6. 1 0.75 0 75 ecision 0.5 Pre 0.25 0 0.25 0.5 0.75 1 RecallFigure 2.43 Relationship between precision and recall g p p
  7. 7. Semantic Web
  8. 8. Semantic WebThe layer language model (Berners-Lee, 2001; Broekstra et al, 2001)
  9. 9. <h1>Student Service Centre</h1>Welcome to the home page of the Student Service Centre.The centre is located in the main building of the University.You may visit us for assistance during working days.<h2>Office hours</h2>Mon to Thu 8am - 6pm<br>Fri 8am - 2pm<p>But note that centre is not open during the weeks of the<a href=”. . .”>State Of Origin</a>. Figure 3.2 Example of a Web page of a Student Service Centre
  10. 10. <organization> <serviceOffered>Admission</serviceOffered> <organizationName>Student Service Centre</organizationName> <staff> <director>John Roth</director> <secretary>Penny Brenner</secretary> </staff></organization> Figure 3.3 Example of a Web page of a Student Service Centre
  11. 11. Figure 3.4 Representing classes and instances (Noy et al., 2001)
  12. 12. Edward lecturer @name Bunker course @title Algorithms course Computati @title onal Algebra lecturer @name Daniela Frost Nonlinear course @title Analysisroot college Sam @name Hoofer Discrete lecturer course @title Structures Modern course co rse @title Algebra Nonlinear course @title Analysis location Innsbruck
  13. 13. Queries 1 and 2 Edward lecturer @name Bunker course @title Algorithms course Computati @title onal Algebra lecturer @name Daniela Frost Nonlinear course @title Analysisroot college Sam @name Hoofer Discrete lecturer course @title Structures Modern course @title Algebra Nonlinear course @title Analysis location Innsbruck
  14. 14. Queries 3 and 4 Edward lecturer @name Bunker course @title Algorithms course Computati @title onal Algebra lecturer @name Daniela Frost Nonlinear course @title Analysisroot college Sam @ @name Hoofer Discrete lecturer course @title Structures Modern course @title Algebra Nonlinear course @title Analysis location Innsbruck
  15. 15. <?xml version="1.0"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" p // g/ / / y # xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about=""> <dc:title> Building an Intelligent Web: Theory and Practice </dc:title> <dc:creator> Rajendra Akerkar and Pawan Lingras </dc:creator> </rdf:Description></rdf:RDF> Figure 3.26 Fragment of RDF
  16. 16. A RDF model for automobiles
  17. 17. <?xml version="1.0"?><rdf:RDF xmlns:rdf http://www.w3.org/1999/02/22 rdf syntax ns# xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:my="http://www.myvehicle.com/vehicle-schema/"> <rdfs:Class rdf:about="#Vehicle"/> <rdfs:Class rdf:about="#Car"> <rdfs:subClassOf rdf:resource="#Vehicle"/> </rdfs:Class> <rdf:Property rdf:about="#name"> df P t df b t "# " <rdfs:domain rdf:resource="#Vehicle"/> </rdf:Property> <rdf:Description rdf:about="#Ford"> <rdf:type rdf:resource="#Car"/> <my:name>Ford Icon</my:name> </rdf:Description> <my:Truck rdf:about="#Mitsubishi"> <my:name>Mitsubishi</my:name> <my:carry rdf:resource="#Mitsubishi"/> </my:Truck></rdf:RDF> Figure 3.29 RDF/XML file for the automobile example
  18. 18. <?xml version="1.0"?><topicMap id="tmrf" xmlns = http://www.topicmaps.org/xtm/1.0/ xmlns:xlink = http://www.w3.org/1999/xlink><!-- The map contains information about Technomathematics Research Foundation. We can include comment and narrative here…-->.... here my topics and my associations go ...</topicMap>Figure 3.30 A Topic Map document(Adopted from http://topicmaps.bond.edu.au/docs/6/1)
  19. 19. Classification and Association
  20. 20. Data Preparation• Database Theory• SQL• Data Transformation• http://www.ecn.purdue.edu/KDDCUP/data/
  21. 21. Classification• Find a rule, a formula, or black box classifier for organizing data into classes. – Classify clients requesting loans into categories based on the likelihood of repayment p y – Classify customers into Big or Moderate Spenders based on what they buy – Classify the customers into loyal, semi-loyal, semi loyal, infrequent based on the products they buy• The classifier is developed from the data in the training set• The reliability of the classifier is evaluated using the test set of data
  22. 22. Classification• ID3 Algorithm – Numerical Illustration – Application to a Small E commerce Dataset E-commerce• C4.5 for Experimentation• Other approaches – Neural Networks – Fuzzy Classification – Rough Set Theory
  23. 23. Association• Market basket analysis – determine which things go together• Transactions might reveal that – customers who buy banana also buy candles – cheese and pickled onions seem to occur frequently in a shopping cart• Information can be used for – arranging a physical shop or structuring the Web site – for targeted advertising campaign
  24. 24. Association• Apriori Algorithm• D Demonstration f an E-commerce t ti for E Application
  25. 25. Clustering
  26. 26. Clustering• Breaks a large database into different subgroups or clusters• Unlike classification there are no predefined classes• Th clusters are put t The l t t together on th basis th the b i of similarity to each other• The data miners determine whether the clusters offer any useful insight
  27. 27. 543210 0 1 2 3 4 5
  28. 28. Statistical Methods• k – means – Numerical Example – Implementation • Data Preparation • Clustering• Other Methods
  29. 29. Neural Network Based Approaches• Kohonen Self Organising Maps – Numerical Demonstration – Application to Web Data Collection• Oth Neural N t Other N l Network B k Based A d Approaches h
  30. 30. Clustering of customers
  31. 31. Web Mining Web Content W bC t t Web Structure W b St t Web Usage W bU Mining Mining Mining General Web Page Search Result Customized Access PatternContent Mining Mining Usage Tracking Tracking
  32. 32. Web Usage Mining
  33. 33. High level web usage mining process (Srivastava et al., 2000) (S i t t l
  34. 34. Applications of web usage mining (Romanko, 2006; Srivastava et al., 2000)
  35. 35. 140.14.6.11 - pawan [06/Sep/2001:10:46:07 -0300] "GET /s.htm HTTP/1.0" 200 2267140.14.7.18 - raj [06/Sep/2001:11:23:53 -0300] "POST /s.cgi HTTP/1.0" 200 499
  36. 36. Clustering exercise
  37. 37. Classification exercise Channel Recall Precision Finance 44.3% 98.27% Health 52.3% 52 3% 89.66% 89 66% Market 49.1% 83.34% News 44.1% 89.27% Shopping 31.5% 91.31% Specials 60.2% 92.86% Sport 50.0% 91.93% Surveys 21.9% 92.66% Theatre 54.8% 94.63%Table 6.8 Precision and recall for predicting user’s interest in channels user s (Baglioni, et al., 2003)
  38. 38. Association exercise News Minimum Maximum Mean Standard Section Requests Requests q q Requests Deviation q Science 1 97 2.3034 2.8184 Culture 1 208 3.7878 5.9742 Sports 1 318 5.6985 10.8360 Economics 1 258 3.9335 7.2341 International 1 208 3.3823 5.5540 Local Lisbon L l Li b 1 460 5.6883 5 6883 11.5650 11 5650 Local Port 1 256 7.5984 13.2351 Politics 1 208 3.3577 5.4101 Society 1 367 4.2673 7.9853 Education 1 90 2.6496 3.29090Table 6.9 Summary statistics of requests to the Publico on-line newspaper (Batista and Silva, 2002)
  39. 39. The association mining showed strong associations between the following pairs: Politics and Society Politics and International News Politics and Sports Society and International News Society and Local Lisbon S Society and Sports y Sp Society and Culture Sports and International News p
  40. 40. Sequence Pattern Analysis of Web L W b Logs
  41. 41. Web Content Mining
  42. 42. Data Collection• Web Crawlers• Public P blic Domain Web Cra lers Crawlers• An Implementation of a Web Crawler
  43. 43. Architecture of a search engine (Romanko, 2006)
  44. 44. Other topics in Web Content Mining• Search Engines – How to prepare for and setup a search engine – Types and listings of search engines (freeware, remote hosting services, commercial)• Multimedia Information Retrieval
  45. 45. Web Structure Mining
  46. 46. 0/10: The site or page is probably new.3/10: The site is perhaps new, small in size and has very little or no worthwhile arriving links. The page gets very little traffic.5/10: The site has a fair amount of worthwhile arriving links and traffic volume. The site might be larger in size and gets a good amount of steady traffic with some return visitors.8/10: The site has many arriving links, probably from other high PageRank pages. The site perhaps contains a lot of information and has a higher traffic flow and return visitor rate. ii10/10: The Web site is large, popular and has an extremely high number of links pointing to it.
  47. 47. http://www.iprcom.com/papers/pagerank/ p p p p p g
  48. 48. Index quality for different search engines (Henzinger, et al., 1999)
  49. 49. Index quality per page for different search engines (Henzinger, et al., 1999)
  50. 50. Page Freq. Freq. Rank Walk2 Walk1 Walk1www.microsoft.com/ 3172 1600 1www.microsoft.com/windows/ie/default.htm 2064 1045 3www.netscape.com/ 1991 876 6www.microsoft.com/ie/www microsoft com/ie/ 1982 1017 4www.microsoft.com/windows/ie/download/ 1915 943 5www.microsoft.com/windows/ie/download/all.htm 1696 830 7www.adobe.com/prodindex/acrobat/readstep.html 1634 780 8home.netscape.com/ 1581 695 10www.linkexchange.com/ 1574 763 9www.yahoo.com/ 1527 1132 2 Table 8.2 Most frequently visited pages (Henzinger, et al., 1999)
  51. 51. Site Frequency Frequency Rank Walk 2 Walk 1 Walk 1www.microsoft.com 32452 16917 1home.netscape.com 23329 11084 2www.adobe.com 10884 5539 3www.amazon.com 10146 5182 4www.netscape.com 4862 2307 10excite.netscape.comexcite netscape com 4714 2372 9www.real.com 4494 2777 5www.lycos.com 4448 2645 6www.zdnet.com 4038 2562 8www.linkexchange.com 3738 1940 12www.yahoo.comwww yahoo com 3461 2595 7 Table 8.3 Most frequently visited hosts (Henzinger, et al., 1999)

×