web page classification

web page classification
with naïve bayes classifiers

nabeelah ali
27 november 2013

outline
• what is web page classification
• motivation
• literature review
• project design
• experiments
• evaluation

web page classification can
be seen as a type of
document classification

documents vs web pages
• web pages have structure
• HTML indicates headings, paragraphs,
meta-information

• web pages are interconnected
• they contain hyperlinks to other pages
• they have locations (URLs)

why?
• user profile mining
• information filtering
• creation of domain-specific search engines

bag of words
text is represented as an unordered
list of words

n-gram representation
• document is represented by vector of
features

• concepts expressed by phrases can be
capture (e.g. “New York” vs “new” and
“york”)

using html structure
• assign weight depending on HTML tags, and
make the feature a linear combination of
these

• e.g. headings would have a greater weight

• four main elements are considered: title,
headings, metadata and main text

Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and
metadata in automated subject classification." Research and Advanced Technology
for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.

visual analysis
• visual representation by web browser is
important

• each web page is visualised as an adjacency
multigraph, with each section representing
a different kind of content

Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel
approach for a Web page classification." Proceedings of
SAWM04 workshop, ECML2004. 2004.

URL features
• pages do not need to be fetched or
analysed

• fast!
• derives tokens from the URL and uses
these tokens as features

Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification
using URL features." Proceedings of the 14th ACM international
conference on Information and knowledge management. ACM, 2005.

project design

dataset
• 4 universities dataset (cornell, texas,
washington, wisconsin)

• each page must be classified into a

category: course, department, faculty,
project, staff, student, other
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/

document classification
single label classification: one and only one
class label is assigned to each instance
hard classification: an instance can either be
or not be in a particular class, with no
intermediate state
multi-class classification: instances that can
be divided into more than two categories

experiment #1
bag of words
use the words, unweighted, as features
istant
ass
CS
Dr
intern
22
0
ission
adm
Professor
room
a rc h
rese

experiment #2

HTML tag weighting

use words weighted by the HTML tags (e.g.
words in <h1> tags will be weighted more
heavily than those in <p> tags)
sistant
as
CS
Dr
intern
22
0

ission ofe
adm
Pr

ssor
room
arch
rese

experiment #3
n-gram
use phrases instead of single words as features
t ant
assis

arch c
rese
onta

c t in

form

ogram description
pr

course outl
ine

atio
n

evaluation

k-fold cross validation

From http://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/

evaluation
confusion matrix

http://en.wikipedia.org/wiki/Confusion_matrix

bibliography
B. Choi and Z. Yao: Web Page Classiﬁcation, StudFuzz 180, 221–274 (2005)
Qi, Xiaoguang, and Brian D. Davison. "Web page classification: Features and
algorithms." ACM Computing Surveys (CSUR) 41.2 (2009): 12.
Golub, Koraljka, and Anders Ardö. "Importance of HTML
structural elements and metadata in automated subject classification." Research and
Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368378.
Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL
features." Proceedings of the 14th ACM international conference on Information
and knowledge management. ACM, 2005.
Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web
page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.

web page classification

More Related Content

What's hot

Similar to web page classification

Recently uploaded

web page classification