IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA
Thamme Gowda
@thammegowda
Dr. Chris Mattmann
@chrismattmann
1
CLUSTERING WEB PAGES BASED ON
STRUCTURE AND STYLE SIMILARITY
Information Retrieval
and Data Science

OUTLINE
• Problem Statement
• Method Overview
• Steps
• Tree Edit Distance
• Style Similarity
• Shared Near Neighbor Clustering
• Evaluation
• Challenges
and Data Science
2

PROBLEM STATEMENT
and Data Science
3
• Scraping data from online marketplaces
• Start with homepage
→ categories →listing → Actual stuff (Detail page)

SAMPLE WEB PAGES
Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
4
1 2 3 4
8765

USELESS
USELESS
5SAMPLE WEB PAGES
1 2 3 4
8765

USELESS
USELESS
6SAMPLE WEB PAGES
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
1 2 3 4
8765

USELESS
USELESS
7SAMPLE WEB PAGES
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
USEFUL USEFUL USEFUL
1 2 3 4
8765

METHOD OVERVIEW
and Data Science
8
CLUSTERING

• “task of grouping a set of objects in such a way that objects
in the same group are more similar (in some sense or the
other) to each other than to those in the other groups”
– Wikipedia
• There are many ways to achieve this.
9
and Data Science
CLUSTERING

HOW DO WE CLUSTER
and Data Science
10
• Based on similarity between pages
• Semantic similarity
• meaning of the web pages (keywords, topics,…)
• Syntactic similarity
• Web page structure, CSS styles
• This presentation has focus on syntactic aspect

• HTML ✓
• CSS ✓
• JavaScript ×
11
and Data Science
SIMILARITY CHECK

METHOD : INPUT
and Data Science
12
WEB PAGES FROM CRAWLER
LIKE APACHE NUTCH

METHOD : STEP #1
and Data Science
13
LIKE APACHE NUTCH
STRUCTURAL SIMILARITY

and Data Science
14
• Web pages are built with
HTML
• HTML Doc → DOM tree
• a labeled ordered tree
• Structural similarity using
tree edit distance(TED)
HTML
HEAD BODY
TITLE DIV P

MINIMUM TREE EDIT DISTANCE
and Data Science
15
• Edit distance measure similar to strings, but on
hierarchical data instead of sequences
• Number of editing operations required to transform
one tree into another.
• Three basic editing operations: INSERT, REMOVE and
REPLACE.
• An useful measure to quantify how similar (or
dissimilar) two trees are.

● Edit operations
● Normalized
distance
* Zhang, K., & Shasha, D. (1989).
Simple fast algorithms for the
editing distance between trees
and related problems. SIAM
journal on computing,18(6),
1245-1262.
16
MINIMUM TREE EDIT DISTANCE*
and Data Science
1 2
3 4

METHOD : STEP #2
and Data Science
17
LIKE APACHE NUTCH
STYLE SIMILARITY
STYLE SIMILARITY

• Similar web pages have similar css styles
• XPath : ”//*[@class]/@class”
• Simple measure -
• Jaccard Similarity on CSS class names
18
and Data Science
STYLE SIMILARITY

METHOD : STEP #3
and Data Science
19
AGGREGATED = k.STRUCTURAL+ (1-k).STYLE
STRUCTURAL
STYLE

METHOD : STEP #4
and Data Science
20
SIMILARITY MATRIX CLUSTERS
CLUSTERING
( SHARED NEAR NEIGHBOR)

“If two data points share a threshold number of
neighbors, then they must belong to the same
cluster” *
21
and Data Science
SHARED NEAR NEIGHBOR (SNN) ALGORITHM
* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors.
Computers, IEEE Transactions on, 100(11), 1025-1034.
Web Pages

• Guessing k in k-means is hard
Meaningful question - “Make clusters of 90% similarity”
instead of “Make 10 clusters”
• Mean / Average of documents in a cluster?
• Average of DOM Trees?
• Average of CSS styles?
• Circular / Spherical / Globular shapes?
22
and Data Science
WHAT’S GOOD ABOUT SNN ALGORITHM

METHOD : LAST STEP*
and Data Science
23
LABELING
CLUSTERS CATEGORIES /USABLE CLUSTERS

METHOD : LAST STEP*
and Data Science
24
LABELING
CLUSTERS CATEGORIES /USABLE CLUSTERS
* HUMAN INTERVENTION - THIS STEP REQUIRES DOMAIN EXPERTISE

SOME APPLICATIONS?
and Data Science
25
• Separate the interesting web pages?
• Drop uninteresting/noisy web pages
• Categorical treatment of clusters
• Extract Structured data using XPath
• Automated extraction using alignment

26
and Data Science
WORKFLOW: PART #1

27Information Retrieval
and Data Science
WORKFLOW: PART #2

DATASET :
1310 Web Pages from http://armslist.com
• 987 Ad detail pages
• 311 Ad listing pages
• 12 others – index, contact, FAQs etc
PARAMETERS:
• 50% weightage for CSS style 50% weight for HTML structure
• Series of experiments on various thresholds : 85%, 90%, 95%
and Data Science
EVALUATION
28

and Data Science
EVALUATION
29
PARAMETERS:
SIMILARITY = 90%
SHARED NEIGHBORS = 90%

and Data Science
EVALUATION
30
PARAMETERS:
SIMILARITY = 95%

and Data Science
EVALUATION
31
PARAMETERS:
SIMILARITY = 85%

• TED very expensive
• Zhang-Shasha’s TED
• O(|T1| x |T2|
x Min{depth(T1), leaves(T1)}
x Min{depth(T2), leaves(T2)})
• That’s O(n4)
• Approx. 1000 HTML Tags
• That’s O(1012)
and Data Science
CHALLENGES
32
Number of HTML Tags
TimeComplexity

and Data Science
ACKNOWLEDGMENTS
DARPA MEMEX
33
* Photo Credits : http://memex.jpl.nasa.gov/

• Source Code
https://github.com/USCDataScience/autoextractor
• Tutorial
https://git.io/vwS69
• Follow up
• Thamme Gowda - @thammegowda
• Chris Mattmann - @chrismattmann
34
and Data Science
THANK YOU

IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

Similar to IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity (20)

More from Thamme Gowda

More from Thamme Gowda (8)

Recently uploaded

Recently uploaded (20)

IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

Editor's Notes