SlideShare a Scribd company logo
1 of 28
Download to read offline
Identification of Relevant Sections in Web Pages Using a
               Machine Learning Approach




                                  Jerrin Shaji George

                                      NIT Calicut


                                  November 8, 2012
Introduction

  There is a massive amount of data available on the internet.
  Extracting only the relevant content has become very important.
  A Machine Learning approach is suitable as it can adapt to the
  rapidly changing dynamics of the internet.




2 of 28
Machine Learning

  The science of getting computers to act without being explicitly
  programmed.
  A method of teaching computers to make and improve predictions
  or behaviors based on some data.
  Machine Learning Algorithms :
          Supervised Machine Learning
          Unsupervised Machine Learning




3 of 28
Supervised Learning

  Machine learning task of inferring a function from labeled training
  data.




           Figure: Supervised Learning Model (courtesy scikit-learn)
4 of 28
Supervised Learning

  Example of a classification problem - discrete valued output.




                   Figure: Copyright c Victor Lavrenko

5 of 28
Supervised Learning

  Example of a regression problem - continuous valued output.




                   Figure: Copyright c Victor Lavrenko

6 of 28
Unsupervised Learning

  The data has no labels. The algorithm tries to find similarities
  between the objects in question.




          Figure: Unsupervised Learning Model (courtesy scikit-learn)
7 of 28
Unsupervised Learning

  Example of a clustering problem




                   Figure: Copyright c Victor Lavrenko
8 of 28
Support Vector machines (SVM)

  A supervised learning model.
  Used for classification and regression analysis.
  The basic SVM:
          A non-probabilistic binary linear classifier.
          Classifies each given input into one of the two possible classes which
          forms the output.




9 of 28
The SVM Algorithm

   Inputs are formulated as feature vectors.
   The feature vectors are mapped into a feature space by using a
   kernel function.
   A division is computed in the feature space to optimally separate
   the classes of training vectors.




10 of 28
The SVM Algorithm

               φ: The Kernel Function




11 of 28
Formal Definition of SVM

   An SVM constructs a hyperplane or set of hyperplanes in a high-
   or infinite-dimensional space.
   It can be used for classification and regression.
   A good separation is achieved by the hyperplane that has the
   largest distance to the nearest training data point of any class
   (called the functional margin).




12 of 28
Optimal Separating Hyperplane




                 Figure: Courtesy Steve Gunn

13 of 28
Functional Margin

   The vectors (points) that constrain the width of the margin are the
   support vectors.




14 of 28
                       Figure: Image from scikit-learn
Mapping to Higher Dimensions

   Sometime data is not linearly separable.
   If the original finite-dimensional space is mapped into a much
   higher-dimensional space, the separation is made easier in that
   space.
   This is achieved by the SVM using the Kernel Trick.




15 of 28
Mapping to Higher Dimensions

   Mapping from 1D to 2D




   Mapping from 2D to 3D




16 of 28
                     Figure: Coutesy Steve Gunn
Identification of Relevant Sections in a Web Page for
Web Search

   Shallow techniques like keyword matching gives unsatisfactory
   results.
   Search methodologies must focus more on contextual information
   than just keyword occurrences.
           Search term might not a be very differentiating term.
           It might not appear in the section at all.

   SQUINT : an SVM based approach to identify sections of a Web
   page relevant to a Web Search.



17 of 28
Overall Architecure




18 of 28
Feature Generation

   Word Rank Based Features
   Bigram Rank Based Features
   Coverage of Top Ranked Tokens
   Query Word Frequency
   Distance from the Query




19 of 28
Word Rank Based Features

   The rank of a word is defined to be its position in the list if the
   words were ordered by frequency of occurrence across all search
   results.
   The value of this feature is the frequency of the particular word in
   the given section.
   Bucketing can be used to reduce dimensionality.




20 of 28
Bigram Rank Based Features

   A bigram is defined to be two consecutive words occurring in a
   section.
   Eg. Machine learning may be more important than machine and
   learning separately.
   The value of the feature is calculated same as Word Rank Based
   Features.




21 of 28
Coverage of Top Ranked Tokens

   Relevance may also be determined by the number of top ranked
   words which occur in the section.
   The value of this feature is the coverage of top ranked words per
   bucket.




22 of 28
Distance from the Query

   The intuition here is that the closer a section is to the query in the
   Web page, the more likely it is to be relevant.
   The value of this feature is the section-wise distance between the
   section in question and the nearest section which contains the
   query.




23 of 28
Query Word Frequency

   The value of this feature is the frequency of the query word in the
   section.
   The value is normalized by the number of words in the section.




24 of 28
Training Set Generation

   Query Google to get a set of pages
   Clean each page remove scripts, pictures, links etc.
   Break each page into sections.
   Label each section of every page.




25 of 28
Learning Algorithm

   An Support Vector Machine with a linear kernel is used.
   Given the relatively high dimensionality of the feature vector, it is a
   reasonable choice to use an SVM.
   The predicted margins of each sample are used to get a non-binary
   metric of how relevant each sections are.




26 of 28
Conclusion

   Support Vector Machines are an attractive approach to data
   modelling.
   Evaluations suggest that using information retrieval inspired
   features and some basic hints from summarization give respectable
   accuracy with respect to detecting the most relevant section in a
   page.
   Thus SQUINT can have a large impact on the user’s overall search
   experience.




27 of 28
References

   Cristianini, Nello; and Shawe-Taylor, John; An Introduction to
   Support Vector Machines and other kernel-based learning methods,
   Cambridge University Press, 2000.
   Siddharth Jonathan J.B., Riku Inoue and Jyotika Prasad. SQUINT
   SVM for Identification of Relevant Sections in Web Pages for Web
   Search.
   Wikipedia article on Machine Learning,
   http://en.wikipedia.org/wiki/Support vector machine
   Machine Learning Course on Coursera,
   https://class.coursera.org/ml-2012-002/class/index



28 of 28

More Related Content

What's hot

Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Edureka!
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applicationsAnish Das
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network HamdaAnees
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learningKnoldus Inc.
 
Machine Learning and Applications
Machine Learning and ApplicationsMachine Learning and Applications
Machine Learning and ApplicationsGeeta Arora
 
Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)butest
 
Introduction To Machine Learning
Introduction To Machine LearningIntroduction To Machine Learning
Introduction To Machine LearningKnoldus Inc.
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
Building Azure Machine Learning Models
Building Azure Machine Learning ModelsBuilding Azure Machine Learning Models
Building Azure Machine Learning ModelsEng Teong Cheah
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Introduction into machine learning
Introduction into machine learningIntroduction into machine learning
Introduction into machine learningmohamed Naas
 

What's hot (20)

Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
 
Techniques Machine Learning
Techniques Machine LearningTechniques Machine Learning
Techniques Machine Learning
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
 
ML Basics
ML BasicsML Basics
ML Basics
 
Machine learning
Machine learning Machine learning
Machine learning
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
Machine Learning and Applications
Machine Learning and ApplicationsMachine Learning and Applications
Machine Learning and Applications
 
Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)
 
Introduction To Machine Learning
Introduction To Machine LearningIntroduction To Machine Learning
Introduction To Machine Learning
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Building Azure Machine Learning Models
Building Azure Machine Learning ModelsBuilding Azure Machine Learning Models
Building Azure Machine Learning Models
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Introduction into machine learning
Introduction into machine learningIntroduction into machine learning
Introduction into machine learning
 

Similar to Identification of Relevant Sections in Web Pages Using a Machine Learning Approach

A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...Editor Jacotech
 
Network intrusion detection using supervised machine learning technique with ...
Network intrusion detection using supervised machine learning technique with ...Network intrusion detection using supervised machine learning technique with ...
Network intrusion detection using supervised machine learning technique with ...CloudTechnologies
 
RESUME SCREENING USING LSTM
RESUME SCREENING USING LSTMRESUME SCREENING USING LSTM
RESUME SCREENING USING LSTMIRJET Journal
 
Student Performance Predictor
Student Performance PredictorStudent Performance Predictor
Student Performance PredictorIRJET Journal
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Dive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxDive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxRakshaAgrawal21
 
Dive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCDive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCRakshaAgrawal21
 
A Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine LearningA Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine LearningIRJET Journal
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
 
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...IRJET Journal
 
Record matching over multiple query result - Document
Record matching over multiple query result - DocumentRecord matching over multiple query result - Document
Record matching over multiple query result - DocumentNishna Ma
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelDr. Abdul Ahad Abro
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXmlaij
 
Top 50 ML Ques & Ans.pdf
Top 50 ML Ques & Ans.pdfTop 50 ML Ques & Ans.pdf
Top 50 ML Ques & Ans.pdfJetender Sharma
 
A Survey on Machine Learning Algorithms
A Survey on Machine Learning AlgorithmsA Survey on Machine Learning Algorithms
A Survey on Machine Learning AlgorithmsAM Publications
 
An Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their ClassifiersAn Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their ClassifiersIJAEMSJORNAL
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET Journal
 

Similar to Identification of Relevant Sections in Web Pages Using a Machine Learning Approach (20)

A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...A survey of modified support vector machine using particle of swarm optimizat...
A survey of modified support vector machine using particle of swarm optimizat...
 
Network intrusion detection using supervised machine learning technique with ...
Network intrusion detection using supervised machine learning technique with ...Network intrusion detection using supervised machine learning technique with ...
Network intrusion detection using supervised machine learning technique with ...
 
RESUME SCREENING USING LSTM
RESUME SCREENING USING LSTMRESUME SCREENING USING LSTM
RESUME SCREENING USING LSTM
 
Student Performance Predictor
Student Performance PredictorStudent Performance Predictor
Student Performance Predictor
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Dive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxDive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptx
 
Dive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCDive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSC
 
International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI), International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI),
 
A Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine LearningA Comparative Study on Identical Face Classification using Machine Learning
A Comparative Study on Identical Face Classification using Machine Learning
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
IRJET- Sentiment Analysis to Segregate Attributes using Machine Learning Tech...
 
Record matching over multiple query result - Document
Record matching over multiple query result - DocumentRecord matching over multiple query result - Document
Record matching over multiple query result - Document
 
Regression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms ExcelRegression with Microsoft Azure & Ms Excel
Regression with Microsoft Azure & Ms Excel
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
Top 50 ML Ques & Ans.pdf
Top 50 ML Ques & Ans.pdfTop 50 ML Ques & Ans.pdf
Top 50 ML Ques & Ans.pdf
 
A Survey on Machine Learning Algorithms
A Survey on Machine Learning AlgorithmsA Survey on Machine Learning Algorithms
A Survey on Machine Learning Algorithms
 
IJET-V3I2P2
IJET-V3I2P2IJET-V3I2P2
IJET-V3I2P2
 
An Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their ClassifiersAn Overview of Supervised Machine Learning Paradigms and their Classifiers
An Overview of Supervised Machine Learning Paradigms and their Classifiers
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
 
Journal Publishers
Journal PublishersJournal Publishers
Journal Publishers
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Identification of Relevant Sections in Web Pages Using a Machine Learning Approach

  • 1. Identification of Relevant Sections in Web Pages Using a Machine Learning Approach Jerrin Shaji George NIT Calicut November 8, 2012
  • 2. Introduction There is a massive amount of data available on the internet. Extracting only the relevant content has become very important. A Machine Learning approach is suitable as it can adapt to the rapidly changing dynamics of the internet. 2 of 28
  • 3. Machine Learning The science of getting computers to act without being explicitly programmed. A method of teaching computers to make and improve predictions or behaviors based on some data. Machine Learning Algorithms : Supervised Machine Learning Unsupervised Machine Learning 3 of 28
  • 4. Supervised Learning Machine learning task of inferring a function from labeled training data. Figure: Supervised Learning Model (courtesy scikit-learn) 4 of 28
  • 5. Supervised Learning Example of a classification problem - discrete valued output. Figure: Copyright c Victor Lavrenko 5 of 28
  • 6. Supervised Learning Example of a regression problem - continuous valued output. Figure: Copyright c Victor Lavrenko 6 of 28
  • 7. Unsupervised Learning The data has no labels. The algorithm tries to find similarities between the objects in question. Figure: Unsupervised Learning Model (courtesy scikit-learn) 7 of 28
  • 8. Unsupervised Learning Example of a clustering problem Figure: Copyright c Victor Lavrenko 8 of 28
  • 9. Support Vector machines (SVM) A supervised learning model. Used for classification and regression analysis. The basic SVM: A non-probabilistic binary linear classifier. Classifies each given input into one of the two possible classes which forms the output. 9 of 28
  • 10. The SVM Algorithm Inputs are formulated as feature vectors. The feature vectors are mapped into a feature space by using a kernel function. A division is computed in the feature space to optimally separate the classes of training vectors. 10 of 28
  • 11. The SVM Algorithm φ: The Kernel Function 11 of 28
  • 12. Formal Definition of SVM An SVM constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space. It can be used for classification and regression. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class (called the functional margin). 12 of 28
  • 13. Optimal Separating Hyperplane Figure: Courtesy Steve Gunn 13 of 28
  • 14. Functional Margin The vectors (points) that constrain the width of the margin are the support vectors. 14 of 28 Figure: Image from scikit-learn
  • 15. Mapping to Higher Dimensions Sometime data is not linearly separable. If the original finite-dimensional space is mapped into a much higher-dimensional space, the separation is made easier in that space. This is achieved by the SVM using the Kernel Trick. 15 of 28
  • 16. Mapping to Higher Dimensions Mapping from 1D to 2D Mapping from 2D to 3D 16 of 28 Figure: Coutesy Steve Gunn
  • 17. Identification of Relevant Sections in a Web Page for Web Search Shallow techniques like keyword matching gives unsatisfactory results. Search methodologies must focus more on contextual information than just keyword occurrences. Search term might not a be very differentiating term. It might not appear in the section at all. SQUINT : an SVM based approach to identify sections of a Web page relevant to a Web Search. 17 of 28
  • 19. Feature Generation Word Rank Based Features Bigram Rank Based Features Coverage of Top Ranked Tokens Query Word Frequency Distance from the Query 19 of 28
  • 20. Word Rank Based Features The rank of a word is defined to be its position in the list if the words were ordered by frequency of occurrence across all search results. The value of this feature is the frequency of the particular word in the given section. Bucketing can be used to reduce dimensionality. 20 of 28
  • 21. Bigram Rank Based Features A bigram is defined to be two consecutive words occurring in a section. Eg. Machine learning may be more important than machine and learning separately. The value of the feature is calculated same as Word Rank Based Features. 21 of 28
  • 22. Coverage of Top Ranked Tokens Relevance may also be determined by the number of top ranked words which occur in the section. The value of this feature is the coverage of top ranked words per bucket. 22 of 28
  • 23. Distance from the Query The intuition here is that the closer a section is to the query in the Web page, the more likely it is to be relevant. The value of this feature is the section-wise distance between the section in question and the nearest section which contains the query. 23 of 28
  • 24. Query Word Frequency The value of this feature is the frequency of the query word in the section. The value is normalized by the number of words in the section. 24 of 28
  • 25. Training Set Generation Query Google to get a set of pages Clean each page remove scripts, pictures, links etc. Break each page into sections. Label each section of every page. 25 of 28
  • 26. Learning Algorithm An Support Vector Machine with a linear kernel is used. Given the relatively high dimensionality of the feature vector, it is a reasonable choice to use an SVM. The predicted margins of each sample are used to get a non-binary metric of how relevant each sections are. 26 of 28
  • 27. Conclusion Support Vector Machines are an attractive approach to data modelling. Evaluations suggest that using information retrieval inspired features and some basic hints from summarization give respectable accuracy with respect to detecting the most relevant section in a page. Thus SQUINT can have a large impact on the user’s overall search experience. 27 of 28
  • 28. References Cristianini, Nello; and Shawe-Taylor, John; An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000. Siddharth Jonathan J.B., Riku Inoue and Jyotika Prasad. SQUINT SVM for Identification of Relevant Sections in Web Pages for Web Search. Wikipedia article on Machine Learning, http://en.wikipedia.org/wiki/Support vector machine Machine Learning Course on Coursera, https://class.coursera.org/ml-2012-002/class/index 28 of 28