A brief introduction about Machine Learning, Supervised and Unsupervised Learning, and Support Vector Machines.
Application of a Supervised Algorithm to identify relevant sections of webpages obtained in search results using an SVM.
Unleash Your Potential - Namagunga Girls Coding Club
Identification of Relevant Sections in Web Pages Using a Machine Learning Approach
1. Identification of Relevant Sections in Web Pages Using a
Machine Learning Approach
Jerrin Shaji George
NIT Calicut
November 8, 2012
2. Introduction
There is a massive amount of data available on the internet.
Extracting only the relevant content has become very important.
A Machine Learning approach is suitable as it can adapt to the
rapidly changing dynamics of the internet.
2 of 28
3. Machine Learning
The science of getting computers to act without being explicitly
programmed.
A method of teaching computers to make and improve predictions
or behaviors based on some data.
Machine Learning Algorithms :
Supervised Machine Learning
Unsupervised Machine Learning
3 of 28
4. Supervised Learning
Machine learning task of inferring a function from labeled training
data.
Figure: Supervised Learning Model (courtesy scikit-learn)
4 of 28
5. Supervised Learning
Example of a classification problem - discrete valued output.
Figure: Copyright c Victor Lavrenko
5 of 28
6. Supervised Learning
Example of a regression problem - continuous valued output.
Figure: Copyright c Victor Lavrenko
6 of 28
7. Unsupervised Learning
The data has no labels. The algorithm tries to find similarities
between the objects in question.
Figure: Unsupervised Learning Model (courtesy scikit-learn)
7 of 28
8. Unsupervised Learning
Example of a clustering problem
Figure: Copyright c Victor Lavrenko
8 of 28
9. Support Vector machines (SVM)
A supervised learning model.
Used for classification and regression analysis.
The basic SVM:
A non-probabilistic binary linear classifier.
Classifies each given input into one of the two possible classes which
forms the output.
9 of 28
10. The SVM Algorithm
Inputs are formulated as feature vectors.
The feature vectors are mapped into a feature space by using a
kernel function.
A division is computed in the feature space to optimally separate
the classes of training vectors.
10 of 28
12. Formal Definition of SVM
An SVM constructs a hyperplane or set of hyperplanes in a high-
or infinite-dimensional space.
It can be used for classification and regression.
A good separation is achieved by the hyperplane that has the
largest distance to the nearest training data point of any class
(called the functional margin).
12 of 28
14. Functional Margin
The vectors (points) that constrain the width of the margin are the
support vectors.
14 of 28
Figure: Image from scikit-learn
15. Mapping to Higher Dimensions
Sometime data is not linearly separable.
If the original finite-dimensional space is mapped into a much
higher-dimensional space, the separation is made easier in that
space.
This is achieved by the SVM using the Kernel Trick.
15 of 28
16. Mapping to Higher Dimensions
Mapping from 1D to 2D
Mapping from 2D to 3D
16 of 28
Figure: Coutesy Steve Gunn
17. Identification of Relevant Sections in a Web Page for
Web Search
Shallow techniques like keyword matching gives unsatisfactory
results.
Search methodologies must focus more on contextual information
than just keyword occurrences.
Search term might not a be very differentiating term.
It might not appear in the section at all.
SQUINT : an SVM based approach to identify sections of a Web
page relevant to a Web Search.
17 of 28
19. Feature Generation
Word Rank Based Features
Bigram Rank Based Features
Coverage of Top Ranked Tokens
Query Word Frequency
Distance from the Query
19 of 28
20. Word Rank Based Features
The rank of a word is defined to be its position in the list if the
words were ordered by frequency of occurrence across all search
results.
The value of this feature is the frequency of the particular word in
the given section.
Bucketing can be used to reduce dimensionality.
20 of 28
21. Bigram Rank Based Features
A bigram is defined to be two consecutive words occurring in a
section.
Eg. Machine learning may be more important than machine and
learning separately.
The value of the feature is calculated same as Word Rank Based
Features.
21 of 28
22. Coverage of Top Ranked Tokens
Relevance may also be determined by the number of top ranked
words which occur in the section.
The value of this feature is the coverage of top ranked words per
bucket.
22 of 28
23. Distance from the Query
The intuition here is that the closer a section is to the query in the
Web page, the more likely it is to be relevant.
The value of this feature is the section-wise distance between the
section in question and the nearest section which contains the
query.
23 of 28
24. Query Word Frequency
The value of this feature is the frequency of the query word in the
section.
The value is normalized by the number of words in the section.
24 of 28
25. Training Set Generation
Query Google to get a set of pages
Clean each page remove scripts, pictures, links etc.
Break each page into sections.
Label each section of every page.
25 of 28
26. Learning Algorithm
An Support Vector Machine with a linear kernel is used.
Given the relatively high dimensionality of the feature vector, it is a
reasonable choice to use an SVM.
The predicted margins of each sample are used to get a non-binary
metric of how relevant each sections are.
26 of 28
27. Conclusion
Support Vector Machines are an attractive approach to data
modelling.
Evaluations suggest that using information retrieval inspired
features and some basic hints from summarization give respectable
accuracy with respect to detecting the most relevant section in a
page.
Thus SQUINT can have a large impact on the user’s overall search
experience.
27 of 28
28. References
Cristianini, Nello; and Shawe-Taylor, John; An Introduction to
Support Vector Machines and other kernel-based learning methods,
Cambridge University Press, 2000.
Siddharth Jonathan J.B., Riku Inoue and Jyotika Prasad. SQUINT
SVM for Identification of Relevant Sections in Web Pages for Web
Search.
Wikipedia article on Machine Learning,
http://en.wikipedia.org/wiki/Support vector machine
Machine Learning Course on Coursera,
https://class.coursera.org/ml-2012-002/class/index
28 of 28