web page classification
with naïve bayes classifiers

nabeelah ali
27 november 2013
outline
• what is web page classification
• motivation
• literature review
• project design
• experiments
• evaluation
description &
motivation
what is classification?
web page classification
web page classification can
be seen as a type of
document classification
documents vs web pages
• web pages have structure
• HTML indicates headings, paragraphs,
meta-information

• web pages are interconnected
• they contain hyperlinks to other pages
• they have locations (URLs)
why?
web directories
why?
improving search results
why?
• user profile mining
• information filtering
• creation of domain-specific search engines
literature
review
bag of words
text is represented as an unordered
list of words
n-gram representation
• document is represented by vector of
features

• concepts expressed by phrases can be
capture (e.g. “New York” vs “new” and
“york”)
using html structure
• assign weight depending on HTML tags, and
make the feature a linear combination of
these

• e.g. headings would have a greater weight

• four main elements are considered: title,
headings, metadata and main text

Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and
metadata in automated subject classification." Research and Advanced Technology
for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.
visual analysis
• visual representation by web browser is
important

• each web page is visualised as an adjacency
multigraph, with each section representing
a different kind of content

Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel
approach for a Web page classification." Proceedings of
SAWM04 workshop, ECML2004. 2004.
URL features
• pages do not need to be fetched or
analysed

• fast!
• derives tokens from the URL and uses
these tokens as features

Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification
using URL features." Proceedings of the 14th ACM international
conference on Information and knowledge management. ACM, 2005.
web page classification
project design
dataset
• 4 universities dataset (cornell, texas,
washington, wisconsin)

• each page must be classified into a

category: course, department, faculty,
project, staff, student, other
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
document classification
single label classification: one and only one
class label is assigned to each instance
hard classification: an instance can either be
or not be in a particular class, with no
intermediate state
multi-class classification: instances that can
be divided into more than two categories
details of the dataset
experiment #1
bag of words
use the words, unweighted, as features
istant
ass
CS
Dr
intern
22
0
ission
adm
Professor
room
a rc h
rese
experiment #2

HTML tag weighting

use words weighted by the HTML tags (e.g.
words in <h1> tags will be weighted more
heavily than those in <p> tags)
sistant
as
CS
Dr
intern
22
0

ission ofe
adm
Pr

ssor
room
arch
rese
experiment #3
n-gram
use phrases instead of single words as features
t ant
assis

arch c
rese
onta

c t in

form

ogram description
pr

course outl
ine

atio
n
evaluation

k-fold cross validation

From http://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/
evaluation
confusion matrix

http://en.wikipedia.org/wiki/Confusion_matrix
bibliography
B. Choi and Z. Yao: Web Page Classification, StudFuzz 180, 221–274 (2005)
Qi, Xiaoguang, and Brian D. Davison. "Web page classification: Features and
algorithms." ACM Computing Surveys (CSUR) 41.2 (2009): 12.
Golub, Koraljka, and Anders Ardö. "Importance of HTML
structural elements and metadata in automated subject classification." Research and
Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368378.
Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL
features." Proceedings of the 14th ACM international conference on Information
and knowledge management. ACM, 2005.
Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web
page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.
questions?

web page classification

  • 1.
    web page classification withnaïve bayes classifiers nabeelah ali 27 november 2013
  • 2.
    outline • what isweb page classification • motivation • literature review • project design • experiments • evaluation
  • 3.
  • 4.
  • 5.
    web page classification webpage classification can be seen as a type of document classification
  • 6.
    documents vs webpages • web pages have structure • HTML indicates headings, paragraphs, meta-information • web pages are interconnected • they contain hyperlinks to other pages • they have locations (URLs)
  • 7.
  • 8.
  • 9.
    why? • user profilemining • information filtering • creation of domain-specific search engines
  • 10.
  • 11.
    bag of words textis represented as an unordered list of words
  • 12.
    n-gram representation • documentis represented by vector of features • concepts expressed by phrases can be capture (e.g. “New York” vs “new” and “york”)
  • 13.
    using html structure •assign weight depending on HTML tags, and make the feature a linear combination of these • e.g. headings would have a greater weight • four main elements are considered: title, headings, metadata and main text Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.
  • 14.
    visual analysis • visualrepresentation by web browser is important • each web page is visualised as an adjacency multigraph, with each section representing a different kind of content Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.
  • 15.
    URL features • pagesdo not need to be fetched or analysed • fast! • derives tokens from the URL and uses these tokens as features Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005.
  • 16.
  • 17.
    dataset • 4 universitiesdataset (cornell, texas, washington, wisconsin) • each page must be classified into a category: course, department, faculty, project, staff, student, other http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
  • 18.
    document classification single labelclassification: one and only one class label is assigned to each instance hard classification: an instance can either be or not be in a particular class, with no intermediate state multi-class classification: instances that can be divided into more than two categories
  • 19.
  • 20.
    experiment #1 bag ofwords use the words, unweighted, as features istant ass CS Dr intern 22 0 ission adm Professor room a rc h rese
  • 21.
    experiment #2 HTML tagweighting use words weighted by the HTML tags (e.g. words in <h1> tags will be weighted more heavily than those in <p> tags) sistant as CS Dr intern 22 0 ission ofe adm Pr ssor room arch rese
  • 22.
    experiment #3 n-gram use phrasesinstead of single words as features t ant assis arch c rese onta c t in form ogram description pr course outl ine atio n
  • 23.
    evaluation k-fold cross validation Fromhttp://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/
  • 24.
  • 25.
    bibliography B. Choi andZ. Yao: Web Page Classification, StudFuzz 180, 221–274 (2005) Qi, Xiaoguang, and Brian D. Davison. "Web page classification: Features and algorithms." ACM Computing Surveys (CSUR) 41.2 (2009): 12. Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368378. Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005. Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.
  • 26.