Smart Crawler

Smart crawler
Classifing the Web
Luiz Henrique Zambom Santana
lhzsantana@gmail.com
Prof. Dr. Mauro Roisenberg
INE – PPGCC – IA conexionista
2015

Agenda
• Objective
• Architecture
• Implementation
• Naive Bayes
• SVM
• Conclusions

Objectives
• Classify web page contents
• Idea:
• If:
• www.infomoney.com.br = Finance
• www.lance.com.br = Futbol
• www.4rodas.com.br = Cars
• So:
• www.valor.com.br = Finance
• placar.abril.com.br = Futbol
• revistaautoesporte.globo.com = Cars

Motivation
• If we know the category of a page, then
• We can better parse
• We can provide better search results
• We can customize the user experience

Architecture
Crawler4J
+
Lucene
Elasticsearch
Crawling
Model
Training
Elasticsearch MLlib
Classifing
MLlibWeb
Model
Crawler4J
+
Lucene

Cluster + In Memory computation

Multinomial Naive Bayes
• Probabilistic model to classification
• From document samples we can infer the document “generation”
• Naive: assumption of independence between every pair of features
• Bag-of-words model
• Histogram of terms probability comparison
BAD RESULTS: assumption of independence in NBC is satisfied by the
variables of your dataset and the degree of class overlapping is small

Support vector machine (SVM)
• Non-probabilistic binary linear classifier
• Can parametrize the number of iteractions
• Slower!
• “One Vs. All” approach with committee [1 e 2]
• Six models, then 21 models
• The model that had more votes is the winner
[1] e Silva, Sergio Roberto de Lima, and
Mauro Roisenberg. "Continuous
authentication by keystroke dynamics
using committee machines." Intelligence
and Security Informatics. Springer Berlin
Heidelberg, 2006. 686-687.
[2] Sun, Bing-Yu, et al. "Support vector
machine committee for
classification."Advances in Neural
Networks–ISNN 2004. Springer Berlin
Heidelberg, 2004. 648-653.
Finance Vs. Sport Finance Vs. Movies Finance Vs. Cars
Sport Vs. Movies Sport Vs. Cars
Movies Vs. Cars

Tools
• Crawler4j: recovering the Web pages
• Jsoup: parsing
• Elasticsearch: indexing
• Lucene: remove stopwords
• Spark Mllib: AI
• Multinomial Naive Bayes
• SVM

Implementation details - Training
1. Set of pages is used as input to the models
String [] pagesCars={"http://g1.globo.com/carros/index.html","http://quatrorodas.abril.com.br/"};
String [] pagesFinance={"http://www.valor.com.br/","http://www.infomoney.com.br/", "http://exame.abril.com.br/"};
String [] pagesSport={"http://globoesporte.globo.com/","http://oledobrasil.com.br/","http://espn.uol.com.br"};
String [] pagesMovies={"http://www.imdb.com/list/ls002231878/","http://www.adorocinema.com/","http://www.filmeb.com.br/",
"http://www.revistabula.com/3165-lista-dos-100-melhores-filmes-de-todos-os-tempos-segundo-hollywood/"};
2. Set of pages is used as input to the models

3. Clean the page and calculate Feature Vector using HashingTF
• Get only the page text (ie., exclude HTML tags)
• Use Lucene to remove stopwords, simbols, numbers and other
meaning less parts
• Calc the term frequence and create a feature vector

16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.filmeb.com.br/ the model predicts movies
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.valor.com.br/ the model predicts finance
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.infomoney.com.br/ the model predicts finance
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://exame.abril.com.br/ the model predicts finance
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://oledobrasil.com.br/ the model predicts sport
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://espn.uol.com.br the model predicts sport
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://sportv.globo.com/site/ the model predicts movies
3. Train the model with SVMWithSGD (Stochastic Gradient Descent)
4. Test the data against the models

Execution
• Run the comitee of 21 models against each page
crawled
• Add the category during the indexing time

Tests
• First dataset
• Classes: Finance (Infomoney), Sports (Lance), Movies (IMDB), and Cars (4 Rodas)
• Size of training set: 10 documents for each class
• Naive Bayes
• Lambda: 1
• Accuracy 87,9%
• Second dataset
• Classes: first dataset AND Gossip, Soup opera, Technology
• SVM
• Iterations: 100
• Regularization: 1
• Most of the documents are correctly classified, but have more models did not bring a
great gain…
• Classification in the test is good, but the crawler can find anything and in this case the
result can be anything

Problems
•Templates in portals (headers and footer)
•Documents with few information (e.g., assine já)
•Documents with too much information (e.g., the main page)

Conclusions
•It correctly classified pages in Finance and Sports
•But the results are not perfect
•Remove templates
•Add other models
•Find better training data
•Other possible classification methods
•Use Colaborative Filtering to c
•https://github.com/lhzsantana/smart-crawler

Smart Crawler

More Related Content

What's hot

Viewers also liked

Similar to Smart Crawler

More from Luiz Henrique Zambom Santana

Smart Crawler