Clustering output of Apache Nutch using Apache Spark

Clustering the output of Apache Nutch
using Apache Spark
Thamme Gowda N. Dr. Chris Mattmann
May 12, 2016. Vancouver, Canada
1

About
● ThammeGowda Narayanaswamy - TG in short - @thammegowda
○ Contributor to Apache Tika and Apache Nutch
○ Now - a grad student @ University of Southern California
○ Past - Technical Co-Founder @ Datoin - http://datoin.com
● Dr. Chris Mattmann @chrismattmann
○ Adj. Prof. and the director of IRDS group
@ University of Southern California, Los Angeles
○ Director @ Apache Software Foundation
○ Chief Architect, NASA JPL
2

Overview
● Problem Statement
● Clustering - a solution
● Structure and Style Similarity
● Shared Near Neighbor Clustering
● Scaling it up using Spark’s Distributed Matrices and
GraphX
● A demo
3

Audience
● Who crawls the web
● Who extracts data from web
● Who filters webpages
● likes to know -
○ web page structure and style similarity
○ shared near neighbor clustering
4

Problem Statement
● Scraping data from online marketplaces
● Start with homepage → categories
→listing pages → Actual stuff (Detail page)
●
5

Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
6

USELESS
USELESS
7

USELESS
USELESS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
8

USELESS
USELESS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
USEFUL FOR
ANALYSIS
USEFUL FOR
ANALYSIS
USEFUL FOR
ANALYSIS
9

Question : How do we solve this?
Answer : Cluster the web pages
10

Why Cluster?
● Separate the interesting web pages?
○ Drop uninteresting/noisy web pages
○ Categorical treatment of clusters
● Extract Structured data using XPath
○ Automated extraction using alignment
11

Goal
● Group web pages that are similar
● Similar in terms of
○ CSS Styles
○ DOM Structure
● Toolkit for experimentation with various thresholds
○ % of similarity in style and/or structure
○ Nice visualizations
12

How do we cluster?
● Based on similarity between pages
● Semantic similarity
○ meaning of the web pages
● Syntactic similarity
○ Web page structure, css styles
● This session has focus on syntactic aspect
13

Structural similarity
● Web pages are built with HTML
● HTML Doc → DOM tree
● a labeled ordered tree
● Structural similarity using tree
edit distance(TED)
HTML
HEAD BODY
TITLE DIV P
14

(Minimum) Tree Edit Distance
● Edit distance measure similar to strings, but on
hierarchical data instead of sequences
● Number of editing operations required to transform one
tree into another.
● Three basic editing operations: INSERT, REMOVE and
REPLACE.
● An useful measure to quantify how similar (or dissimilar)
two trees are.
15

Example: Tree Edit Distance*
● Edit operations
● Normalized
distance
* Zhang, K., & Shasha, D.
(1989). Simple fast algorithms
for the editing distance
between trees and related
problems. SIAM journal on
computing,18(6), 1245-1262.
16

Style Similarity
● Have you noticed ?
○ Similar web pages have similar css styles
● XPath : ”//*[@class]/@class”
● Simple measure -
○ Jaccard Similarity on CSS class names
○
17

Web pages consists of :
● HTML ✓
● CSS ✓
● JavaScript ×
18

Aggregating the Style and Structure
● StructuralSimilarity : Normalized Tree Edit Distance
● StyleSimilarity : Jaccard Distance
● Combine on a linear scale
○ Aggregated = k . Structural + (1-k) Style
19

Implementation
● Read Nutch’s Segements
○ sparkContext.sequneceFile(...)
● Filter web pages
○ Robust content type detection -- Tika
● Structural Similarity
○ HTML to DOM Tree -- NeckoHtml
○ Tree Edit Distance -- Zhang Shasha’s algorithm
21

Implementation …
● Style Similarity
○ Query CSS class names using Xpath
● Similarity Matrix
○ sparkContext.cartesian() to get nxn cells
○ Spark’s Distributed (Coordinate) Matrix
● Persist the matrix for later experimentation with
multiple thresholds
22

Clustering
● Shared Near Neighbor Clustering
○ Jarvis et al , 1973
● With improvements
○ Graph based Implementation
■ Spark GraphX for the win!
* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared
near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034.
23

What’s good about this algorithm?
● What’s the difficulty with the most popular k-means?
○ Prior knowledge of clusters?
○ Mean/Average of documents in a cluster?
■ Average of DOM Trees?
■ Average of CSS styles?
○ Circular/Spherical/Globular shapes?
● Shared Near Neighbor Cluster
○ Similarity matrix - pluggable similarity measures - generic
○ Thresholds - numbers , percent of match
24

Shared Near Neighbor Algorithm
“If two data points share a threshold number of
neighbors, then they must belong to the same
cluster”
25

Clustering Implementation
● Similarity Matrix to Graph
○ Clusters as nodes, similarity measure as edges
● Check for Similar neighbors
○
○ Filter on threshold and Merge
■ Immutable! - new graph for next iteration
○ Repeat
26

Shared Near Neighbor Clustering on
Apache Spark GraphX
27

Challenges
● Tree Edit Distance is very expensive
28

What’s ahead on the road?
● Integrate to Apache Nutch
● Auto Extraction
○ Unsupervised learning on structure of pages and scrape
the actual data of the web page
● Faster Tree Edit Distance
○ May be with approximation techniques
29

Summary
● Example Scenario
● Similarity measures
● Clustering as a solution
● Demo
31

Acknowledgements
● Dr. Chris Mattmann
○ My mentor
○ Professor, Director at IRDS @ USC - http://irds.usc.edu
○ Director, Apache Software Foundation
● DARPA Memex project
32

Thank You!
● Source Code
● Tutorial
● Follow up
○ Thamme Gowda - @thammegowda
○ Chris Mattmann - @chrismattmann
33

Clustering output of Apache Nutch using Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Clustering output of Apache Nutch using Apache Spark

Similar to Clustering output of Apache Nutch using Apache Spark (20)

More from Thamme Gowda

More from Thamme Gowda (7)

Recently uploaded

Recently uploaded (20)

Clustering output of Apache Nutch using Apache Spark