Smart crawler
Classifing the Web
Luiz Henrique Zambom Santana
lhzsantana@gmail.com
Prof. Dr. Mauro Roisenberg
INE – PPGCC – IA conexionista
2015
Agenda
• Objective
• Architecture
• Implementation
• Naive Bayes
• SVM
• Conclusions
Objectives
• Classify web page contents
• Idea:
• If:
• www.infomoney.com.br = Finance
• www.lance.com.br = Futbol
• www.4rodas.com.br = Cars
• So:
• www.valor.com.br = Finance
• placar.abril.com.br = Futbol
• revistaautoesporte.globo.com = Cars
Motivation
• If we know the category of a page, then
• We can better parse
• We can provide better search results
• We can customize the user experience
Architecture
Crawler4J
+
Lucene
Elasticsearch
Crawling
Model
Training
Elasticsearch MLlib
Classifing
MLlibWeb
Model
Crawler4J
+
Lucene
Apache Spark
MLlib
Apache Spark
Cluster + In Memory computation
Multinomial Naive Bayes
• Probabilistic model to classification
• From document samples we can infer the document “generation”
• Naive: assumption of independence between every pair of features
• Bag-of-words model
• Histogram of terms probability comparison
BAD RESULTS: assumption of independence in NBC is satisfied by the
variables of your dataset and the degree of class overlapping is small
Support vector machine (SVM)
• Non-probabilistic binary linear classifier
• Can parametrize the number of iteractions
• Slower!
• “One Vs. All” approach with committee [1 e 2]
• Six models, then 21 models
• The model that had more votes is the winner
[1] e Silva, Sergio Roberto de Lima, and
Mauro Roisenberg. "Continuous
authentication by keystroke dynamics
using committee machines." Intelligence
and Security Informatics. Springer Berlin
Heidelberg, 2006. 686-687.
[2] Sun, Bing-Yu, et al. "Support vector
machine committee for
classification."Advances in Neural
Networks–ISNN 2004. Springer Berlin
Heidelberg, 2004. 648-653.
Finance Vs. Sport Finance Vs. Movies Finance Vs. Cars
Sport Vs. Movies Sport Vs. Cars
Movies Vs. Cars
Tools
• Crawler4j: recovering the Web pages
• Jsoup: parsing
• Elasticsearch: indexing
• Lucene: remove stopwords
• Spark Mllib: AI
• Multinomial Naive Bayes
• SVM
Implementation details - Training
1. Set of pages is used as input to the models
String [] pagesCars={"http://g1.globo.com/carros/index.html","http://quatrorodas.abril.com.br/"};
String [] pagesFinance={"http://www.valor.com.br/","http://www.infomoney.com.br/", "http://exame.abril.com.br/"};
String [] pagesSport={"http://globoesporte.globo.com/","http://oledobrasil.com.br/","http://espn.uol.com.br"};
String [] pagesMovies={"http://www.imdb.com/list/ls002231878/","http://www.adorocinema.com/","http://www.filmeb.com.br/",
"http://www.revistabula.com/3165-lista-dos-100-melhores-filmes-de-todos-os-tempos-segundo-hollywood/"};
2. Set of pages is used as input to the models
Implementation details - Training
3. Clean the page and calculate Feature Vector using HashingTF
• Get only the page text (ie., exclude HTML tags)
• Use Lucene to remove stopwords, simbols, numbers and other
meaning less parts
• Calc the term frequence and create a feature vector
Implementation details - Training
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.filmeb.com.br/ the model predicts movies
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.valor.com.br/ the model predicts finance
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.infomoney.com.br/ the model predicts finance
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://exame.abril.com.br/ the model predicts finance
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://oledobrasil.com.br/ the model predicts sport
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://espn.uol.com.br the model predicts sport
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://sportv.globo.com/site/ the model predicts movies
3. Train the model with SVMWithSGD (Stochastic Gradient Descent)
4. Test the data against the models
Execution
• Run the comitee of 21 models against each page
crawled
• Add the category during the indexing time
Tests
• First dataset
• Classes: Finance (Infomoney), Sports (Lance), Movies (IMDB), and Cars (4 Rodas)
• Size of training set: 10 documents for each class
• Naive Bayes
• Lambda: 1
• Accuracy 87,9%
• Second dataset
• Classes: first dataset AND Gossip, Soup opera, Technology
• SVM
• Iterations: 100
• Regularization: 1
• Most of the documents are correctly classified, but have more models did not bring a
great gain…
• Classification in the test is good, but the crawler can find anything and in this case the
result can be anything
Problems
•Templates in portals (headers and footer)
•Documents with few information (e.g., assine já)
•Documents with too much information (e.g., the main page)
Conclusions
•It correctly classified pages in Finance and Sports
•But the results are not perfect
•Remove templates
•Add other models
•Find better training data
•Other possible classification methods
•Use Colaborative Filtering to c
•https://github.com/lhzsantana/smart-crawler

Smart Crawler

  • 1.
    Smart crawler Classifing theWeb Luiz Henrique Zambom Santana lhzsantana@gmail.com Prof. Dr. Mauro Roisenberg INE – PPGCC – IA conexionista 2015
  • 2.
    Agenda • Objective • Architecture •Implementation • Naive Bayes • SVM • Conclusions
  • 3.
    Objectives • Classify webpage contents • Idea: • If: • www.infomoney.com.br = Finance • www.lance.com.br = Futbol • www.4rodas.com.br = Cars • So: • www.valor.com.br = Finance • placar.abril.com.br = Futbol • revistaautoesporte.globo.com = Cars
  • 4.
    Motivation • If weknow the category of a page, then • We can better parse • We can provide better search results • We can customize the user experience
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    Cluster + InMemory computation
  • 10.
    Multinomial Naive Bayes •Probabilistic model to classification • From document samples we can infer the document “generation” • Naive: assumption of independence between every pair of features • Bag-of-words model • Histogram of terms probability comparison BAD RESULTS: assumption of independence in NBC is satisfied by the variables of your dataset and the degree of class overlapping is small
  • 11.
    Support vector machine(SVM) • Non-probabilistic binary linear classifier • Can parametrize the number of iteractions • Slower! • “One Vs. All” approach with committee [1 e 2] • Six models, then 21 models • The model that had more votes is the winner [1] e Silva, Sergio Roberto de Lima, and Mauro Roisenberg. "Continuous authentication by keystroke dynamics using committee machines." Intelligence and Security Informatics. Springer Berlin Heidelberg, 2006. 686-687. [2] Sun, Bing-Yu, et al. "Support vector machine committee for classification."Advances in Neural Networks–ISNN 2004. Springer Berlin Heidelberg, 2004. 648-653. Finance Vs. Sport Finance Vs. Movies Finance Vs. Cars Sport Vs. Movies Sport Vs. Cars Movies Vs. Cars
  • 12.
    Tools • Crawler4j: recoveringthe Web pages • Jsoup: parsing • Elasticsearch: indexing • Lucene: remove stopwords • Spark Mllib: AI • Multinomial Naive Bayes • SVM
  • 13.
    Implementation details -Training 1. Set of pages is used as input to the models String [] pagesCars={"http://g1.globo.com/carros/index.html","http://quatrorodas.abril.com.br/"}; String [] pagesFinance={"http://www.valor.com.br/","http://www.infomoney.com.br/", "http://exame.abril.com.br/"}; String [] pagesSport={"http://globoesporte.globo.com/","http://oledobrasil.com.br/","http://espn.uol.com.br"}; String [] pagesMovies={"http://www.imdb.com/list/ls002231878/","http://www.adorocinema.com/","http://www.filmeb.com.br/", "http://www.revistabula.com/3165-lista-dos-100-melhores-filmes-de-todos-os-tempos-segundo-hollywood/"}; 2. Set of pages is used as input to the models
  • 14.
    Implementation details -Training 3. Clean the page and calculate Feature Vector using HashingTF • Get only the page text (ie., exclude HTML tags) • Use Lucene to remove stopwords, simbols, numbers and other meaning less parts • Calc the term frequence and create a feature vector
  • 15.
    Implementation details -Training 16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.filmeb.com.br/ the model predicts movies 16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.valor.com.br/ the model predicts finance 16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.infomoney.com.br/ the model predicts finance 16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://exame.abril.com.br/ the model predicts finance 16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://oledobrasil.com.br/ the model predicts sport 16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://espn.uol.com.br the model predicts sport 16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://sportv.globo.com/site/ the model predicts movies 3. Train the model with SVMWithSGD (Stochastic Gradient Descent) 4. Test the data against the models
  • 16.
    Execution • Run thecomitee of 21 models against each page crawled • Add the category during the indexing time
  • 17.
    Tests • First dataset •Classes: Finance (Infomoney), Sports (Lance), Movies (IMDB), and Cars (4 Rodas) • Size of training set: 10 documents for each class • Naive Bayes • Lambda: 1 • Accuracy 87,9% • Second dataset • Classes: first dataset AND Gossip, Soup opera, Technology • SVM • Iterations: 100 • Regularization: 1 • Most of the documents are correctly classified, but have more models did not bring a great gain… • Classification in the test is good, but the crawler can find anything and in this case the result can be anything
  • 18.
    Problems •Templates in portals(headers and footer) •Documents with few information (e.g., assine já) •Documents with too much information (e.g., the main page)
  • 19.
    Conclusions •It correctly classifiedpages in Finance and Sports •But the results are not perfect •Remove templates •Add other models •Find better training data •Other possible classification methods •Use Colaborative Filtering to c •https://github.com/lhzsantana/smart-crawler