Identification of Relevant Sections in Web Pages Using a Machine Learning Approach


Published on

A brief introduction about Machine Learning, Supervised and Unsupervised Learning, and Support Vector Machines.
Application of a Supervised Algorithm to identify relevant sections of webpages obtained in search results using an SVM.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Identification of Relevant Sections in Web Pages Using a Machine Learning Approach

  1. 1. Identification of Relevant Sections in Web Pages Using a Machine Learning Approach Jerrin Shaji George NIT Calicut November 8, 2012
  2. 2. Introduction There is a massive amount of data available on the internet. Extracting only the relevant content has become very important. A Machine Learning approach is suitable as it can adapt to the rapidly changing dynamics of the internet.2 of 28
  3. 3. Machine Learning The science of getting computers to act without being explicitly programmed. A method of teaching computers to make and improve predictions or behaviors based on some data. Machine Learning Algorithms : Supervised Machine Learning Unsupervised Machine Learning3 of 28
  4. 4. Supervised Learning Machine learning task of inferring a function from labeled training data. Figure: Supervised Learning Model (courtesy scikit-learn)4 of 28
  5. 5. Supervised Learning Example of a classification problem - discrete valued output. Figure: Copyright c Victor Lavrenko5 of 28
  6. 6. Supervised Learning Example of a regression problem - continuous valued output. Figure: Copyright c Victor Lavrenko6 of 28
  7. 7. Unsupervised Learning The data has no labels. The algorithm tries to find similarities between the objects in question. Figure: Unsupervised Learning Model (courtesy scikit-learn)7 of 28
  8. 8. Unsupervised Learning Example of a clustering problem Figure: Copyright c Victor Lavrenko8 of 28
  9. 9. Support Vector machines (SVM) A supervised learning model. Used for classification and regression analysis. The basic SVM: A non-probabilistic binary linear classifier. Classifies each given input into one of the two possible classes which forms the output.9 of 28
  10. 10. The SVM Algorithm Inputs are formulated as feature vectors. The feature vectors are mapped into a feature space by using a kernel function. A division is computed in the feature space to optimally separate the classes of training vectors.10 of 28
  11. 11. The SVM Algorithm φ: The Kernel Function11 of 28
  12. 12. Formal Definition of SVM An SVM constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space. It can be used for classification and regression. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class (called the functional margin).12 of 28
  13. 13. Optimal Separating Hyperplane Figure: Courtesy Steve Gunn13 of 28
  14. 14. Functional Margin The vectors (points) that constrain the width of the margin are the support vectors.14 of 28 Figure: Image from scikit-learn
  15. 15. Mapping to Higher Dimensions Sometime data is not linearly separable. If the original finite-dimensional space is mapped into a much higher-dimensional space, the separation is made easier in that space. This is achieved by the SVM using the Kernel Trick.15 of 28
  16. 16. Mapping to Higher Dimensions Mapping from 1D to 2D Mapping from 2D to 3D16 of 28 Figure: Coutesy Steve Gunn
  17. 17. Identification of Relevant Sections in a Web Page forWeb Search Shallow techniques like keyword matching gives unsatisfactory results. Search methodologies must focus more on contextual information than just keyword occurrences. Search term might not a be very differentiating term. It might not appear in the section at all. SQUINT : an SVM based approach to identify sections of a Web page relevant to a Web Search.17 of 28
  18. 18. Overall Architecure18 of 28
  19. 19. Feature Generation Word Rank Based Features Bigram Rank Based Features Coverage of Top Ranked Tokens Query Word Frequency Distance from the Query19 of 28
  20. 20. Word Rank Based Features The rank of a word is defined to be its position in the list if the words were ordered by frequency of occurrence across all search results. The value of this feature is the frequency of the particular word in the given section. Bucketing can be used to reduce dimensionality.20 of 28
  21. 21. Bigram Rank Based Features A bigram is defined to be two consecutive words occurring in a section. Eg. Machine learning may be more important than machine and learning separately. The value of the feature is calculated same as Word Rank Based Features.21 of 28
  22. 22. Coverage of Top Ranked Tokens Relevance may also be determined by the number of top ranked words which occur in the section. The value of this feature is the coverage of top ranked words per bucket.22 of 28
  23. 23. Distance from the Query The intuition here is that the closer a section is to the query in the Web page, the more likely it is to be relevant. The value of this feature is the section-wise distance between the section in question and the nearest section which contains the query.23 of 28
  24. 24. Query Word Frequency The value of this feature is the frequency of the query word in the section. The value is normalized by the number of words in the section.24 of 28
  25. 25. Training Set Generation Query Google to get a set of pages Clean each page remove scripts, pictures, links etc. Break each page into sections. Label each section of every page.25 of 28
  26. 26. Learning Algorithm An Support Vector Machine with a linear kernel is used. Given the relatively high dimensionality of the feature vector, it is a reasonable choice to use an SVM. The predicted margins of each sample are used to get a non-binary metric of how relevant each sections are.26 of 28
  27. 27. Conclusion Support Vector Machines are an attractive approach to data modelling. Evaluations suggest that using information retrieval inspired features and some basic hints from summarization give respectable accuracy with respect to detecting the most relevant section in a page. Thus SQUINT can have a large impact on the user’s overall search experience.27 of 28
  28. 28. References Cristianini, Nello; and Shawe-Taylor, John; An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000. Siddharth Jonathan J.B., Riku Inoue and Jyotika Prasad. SQUINT SVM for Identification of Relevant Sections in Web Pages for Web Search. Wikipedia article on Machine Learning, vector machine Machine Learning Course on Coursera, of 28