The structural similarity of HTML pages is measured by using Tree Edit Distance measure on DOM trees. The stylistic similarity is measured by using Jaccard similarity on CSS class names. An aggregated similarity measure is computed by combining structural and stylistic measures. A clustering method is then applied to this aggregated similarity measure to group the documents.
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
1. July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA
Thamme Gowda
@thammegowda
Dr. Chris Mattmann
@chrismattmann
1
CLUSTERING WEB PAGES BASED ON
STRUCTURE AND STYLE SIMILARITY
Information Retrieval
and Data Science
2. OUTLINE
• Problem Statement
• Method Overview
• Steps
• Tree Edit Distance
• Style Similarity
• Shared Near Neighbor Clustering
• Evaluation
• Challenges
Information Retrieval
and Data Science
2
9. • “task of grouping a set of objects in such a way that objects
in the same group are more similar (in some sense or the
other) to each other than to those in the other groups”
– Wikipedia
• There are many ways to achieve this.
9
Information Retrieval
and Data Science
CLUSTERING
10. HOW DO WE CLUSTER
Information Retrieval
and Data Science
10
• Based on similarity between pages
• Semantic similarity
• meaning of the web pages (keywords, topics,…)
• Syntactic similarity
• Web page structure, CSS styles
• This presentation has focus on syntactic aspect
11. • HTML ✓
• CSS ✓
• JavaScript ×
11
Information Retrieval
and Data Science
SIMILARITY CHECK
13. METHOD : STEP #1
Information Retrieval
and Data Science
13
WEB PAGES FROM CRAWLER
LIKE APACHE NUTCH
STRUCTURAL SIMILARITY
STRUCTURAL SIMILARITY
14. STRUCTURAL SIMILARITY
Information Retrieval
and Data Science
14
• Web pages are built with
HTML
• HTML Doc → DOM tree
• a labeled ordered tree
• Structural similarity using
tree edit distance(TED)
HTML
HEAD BODY
TITLE DIV P
15. MINIMUM TREE EDIT DISTANCE
Information Retrieval
and Data Science
15
• Edit distance measure similar to strings, but on
hierarchical data instead of sequences
• Number of editing operations required to transform
one tree into another.
• Three basic editing operations: INSERT, REMOVE and
REPLACE.
• An useful measure to quantify how similar (or
dissimilar) two trees are.
16. ● Edit operations
● Normalized
distance
* Zhang, K., & Shasha, D. (1989).
Simple fast algorithms for the
editing distance between trees
and related problems. SIAM
journal on computing,18(6),
1245-1262.
16
MINIMUM TREE EDIT DISTANCE*
Information Retrieval
and Data Science
1 2
3 4
17. METHOD : STEP #2
Information Retrieval
and Data Science
17
WEB PAGES FROM CRAWLER
LIKE APACHE NUTCH
STYLE SIMILARITY
STYLE SIMILARITY
18. • Similar web pages have similar css styles
• XPath : ”//*[@class]/@class”
• Simple measure -
• Jaccard Similarity on CSS class names
18
Information Retrieval
and Data Science
STYLE SIMILARITY
19. METHOD : STEP #3
Information Retrieval
and Data Science
19
AGGREGATED = k.STRUCTURAL+ (1-k).STYLE
STRUCTURAL
STYLE
20. METHOD : STEP #4
Information Retrieval
and Data Science
20
SIMILARITY MATRIX CLUSTERS
CLUSTERING
( SHARED NEAR NEIGHBOR)
21. “If two data points share a threshold number of
neighbors, then they must belong to the same
cluster” *
21
Information Retrieval
and Data Science
SHARED NEAR NEIGHBOR (SNN) ALGORITHM
* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors.
Computers, IEEE Transactions on, 100(11), 1025-1034.
Web Pages
22. • Guessing k in k-means is hard
Meaningful question - “Make clusters of 90% similarity”
instead of “Make 10 clusters”
• Mean / Average of documents in a cluster?
• Average of DOM Trees?
• Average of CSS styles?
• Circular / Spherical / Globular shapes?
22
Information Retrieval
and Data Science
WHAT’S GOOD ABOUT SNN ALGORITHM
23. METHOD : LAST STEP*
Information Retrieval
and Data Science
23
LABELING
CLUSTERS CATEGORIES /USABLE CLUSTERS
24. METHOD : LAST STEP*
Information Retrieval
and Data Science
24
LABELING
CLUSTERS CATEGORIES /USABLE CLUSTERS
* HUMAN INTERVENTION - THIS STEP REQUIRES DOMAIN EXPERTISE
25. SOME APPLICATIONS?
Information Retrieval
and Data Science
25
• Separate the interesting web pages?
• Drop uninteresting/noisy web pages
• Categorical treatment of clusters
• Extract Structured data using XPath
• Automated extraction using alignment
28. DATASET :
1310 Web Pages from http://armslist.com
• 987 Ad detail pages
• 311 Ad listing pages
• 12 others – index, contact, FAQs etc
PARAMETERS:
• 50% weightage for CSS style 50% weight for HTML structure
• Series of experiments on various thresholds : 85%, 90%, 95%
Information Retrieval
and Data Science
EVALUATION
28
32. • TED very expensive
• Zhang-Shasha’s TED
• O(|T1| x |T2|
x Min{depth(T1), leaves(T1)}
x Min{depth(T2), leaves(T2)})
• That’s O(n4)
• Approx. 1000 HTML Tags
• That’s O(1012)
Information Retrieval
and Data Science
CHALLENGES
32
Number of HTML Tags
TimeComplexity