SlideShare a Scribd company logo
Clustering the output of Apache Nutch
using Apache Spark
Thamme Gowda N. Dr. Chris Mattmann
May 12, 2016. Vancouver, Canada
1
About
● ThammeGowda Narayanaswamy - TG in short - @thammegowda
○ Contributor to Apache Tika and Apache Nutch
○ Now - a grad student @ University of Southern California
○ Past - Technical Co-Founder @ Datoin - http://datoin.com
● Dr. Chris Mattmann @chrismattmann
○ Adj. Prof. and the director of IRDS group
@ University of Southern California, Los Angeles
○ Director @ Apache Software Foundation
○ Chief Architect, NASA JPL
2
Overview
● Problem Statement
● Clustering - a solution
● Structure and Style Similarity
● Shared Near Neighbor Clustering
● Scaling it up using Spark’s Distributed Matrices and
GraphX
● A demo
3
Audience
● Who crawls the web
● Who extracts data from web
● Who filters webpages
● likes to know -
○ web page structure and style similarity
○ shared near neighbor clustering
4
Problem Statement
● Scraping data from online marketplaces
● Start with homepage → categories
→listing pages → Actual stuff (Detail page)
●
5
Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
6
Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
7
Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
8
Sample set of web pages
credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
REQUIRED FOR
CRAWLER,
BUT
NOT IMPORTANT
FOR ANALYSIS
USEFUL FOR
ANALYSIS
USEFUL FOR
ANALYSIS
USEFUL FOR
ANALYSIS
9
Question : How do we solve this?
Answer : Cluster the web pages
10
Why Cluster?
● Separate the interesting web pages?
○ Drop uninteresting/noisy web pages
○ Categorical treatment of clusters
● Extract Structured data using XPath
○ Automated extraction using alignment
11
Goal
● Group web pages that are similar
● Similar in terms of
○ CSS Styles
○ DOM Structure
● Toolkit for experimentation with various thresholds
○ % of similarity in style and/or structure
○ Nice visualizations
12
How do we cluster?
● Based on similarity between pages
● Semantic similarity
○ meaning of the web pages
● Syntactic similarity
○ Web page structure, css styles
● This session has focus on syntactic aspect
13
Structural similarity
● Web pages are built with HTML
● HTML Doc → DOM tree
● a labeled ordered tree
● Structural similarity using tree
edit distance(TED)
HTML
HEAD BODY
TITLE DIV P
14
(Minimum) Tree Edit Distance
● Edit distance measure similar to strings, but on
hierarchical data instead of sequences
● Number of editing operations required to transform one
tree into another.
● Three basic editing operations: INSERT, REMOVE and
REPLACE.
● An useful measure to quantify how similar (or dissimilar)
two trees are.
15
Example: Tree Edit Distance*
● Edit operations
● Normalized
distance
* Zhang, K., & Shasha, D.
(1989). Simple fast algorithms
for the editing distance
between trees and related
problems. SIAM journal on
computing,18(6), 1245-1262.
16
Style Similarity
● Have you noticed ?
○ Similar web pages have similar css styles
● XPath : ”//*[@class]/@class”
● Simple measure -
○ Jaccard Similarity on CSS class names
○
17
Web pages consists of :
● HTML ✓
● CSS ✓
● JavaScript ×
18
Aggregating the Style and Structure
● StructuralSimilarity : Normalized Tree Edit Distance
● StyleSimilarity : Jaccard Distance
● Combine on a linear scale
○ Aggregated = k . Structural + (1-k) Style
19
Implementation
20
Implementation
● Read Nutch’s Segements
○ sparkContext.sequneceFile(...)
● Filter web pages
○ Robust content type detection -- Tika
● Structural Similarity
○ HTML to DOM Tree -- NeckoHtml
○ Tree Edit Distance -- Zhang Shasha’s algorithm
21
Implementation …
● Style Similarity
○ Query CSS class names using Xpath
● Similarity Matrix
○ sparkContext.cartesian() to get nxn cells
○ Spark’s Distributed (Coordinate) Matrix
● Persist the matrix for later experimentation with
multiple thresholds
22
Clustering
● Shared Near Neighbor Clustering
○ Jarvis et al , 1973
● With improvements
○ Graph based Implementation
■ Spark GraphX for the win!
* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared
near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034.
23
What’s good about this algorithm?
● What’s the difficulty with the most popular k-means?
○ Prior knowledge of clusters?
○ Mean/Average of documents in a cluster?
■ Average of DOM Trees?
■ Average of CSS styles?
○ Circular/Spherical/Globular shapes?
● Shared Near Neighbor Cluster
○ Similarity matrix - pluggable similarity measures - generic
○ Thresholds - numbers , percent of match
24
Shared Near Neighbor Algorithm
“If two data points share a threshold number of
neighbors, then they must belong to the same
cluster”
25
Clustering Implementation
● Similarity Matrix to Graph
○ Clusters as nodes, similarity measure as edges
● Check for Similar neighbors
○
○ Filter on threshold and Merge
■ Immutable! - new graph for next iteration
○ Repeat
26
Shared Near Neighbor Clustering on
Apache Spark GraphX
27
Challenges
● Tree Edit Distance is very expensive
28
What’s ahead on the road?
● Integrate to Apache Nutch
● Auto Extraction
○ Unsupervised learning on structure of pages and scrape
the actual data of the web page
● Faster Tree Edit Distance
○ May be with approximation techniques
29
Demo
30
Summary
● Example Scenario
● Similarity measures
● Clustering as a solution
● Demo
31
Acknowledgements
● Dr. Chris Mattmann
○ My mentor
○ Professor, Director at IRDS @ USC - http://irds.usc.edu
○ Director, Apache Software Foundation
● DARPA Memex project
32
Thank You!
● Source Code
● Tutorial
● Follow up
○ Thamme Gowda - @thammegowda
○ Chris Mattmann - @chrismattmann
33

More Related Content

What's hot

Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF Data
Graph-TA
 
Graph database
Graph database Graph database
Graph database
Shruti Arya
 
Graph Database
Graph DatabaseGraph Database
Graph Database
Richard Kuo
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
Lewis Crawford
 
Linked Open Data and DANS
Linked Open Data and DANSLinked Open Data and DANS
Linked Open Data and DANS
vty
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage Information
Enno Meijers
 
Open data easy, explicit and fast
Open data easy, explicit and fastOpen data easy, explicit and fast
Open data easy, explicit and fast
MetaSolutions AB
 
Scripting User Contributed Interlinking
Scripting User Contributed InterlinkingScripting User Contributed Interlinking
Scripting User Contributed Interlinking
whalb
 
Pandas
PandasPandas
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
DESTIN-Informatique.com
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
Ernesto Reig
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xml
Felix Sasaki
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked data
Albert Meroño-Peñuela
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
EUCLID project
 
DBPedia-past-present-future
DBPedia-past-present-futureDBPedia-past-present-future
DBPedia-past-present-future
Data Science Society
 
NO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloudNO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloud
Manu Cohen-Yashar
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data Portals
Peter Haase
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
dgarijo
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
EUCLID project
 
Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2
Tsendsuren Munkhdalai
 

What's hot (20)

Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF Data
 
Graph database
Graph database Graph database
Graph database
 
Graph Database
Graph DatabaseGraph Database
Graph Database
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
Linked Open Data and DANS
Linked Open Data and DANSLinked Open Data and DANS
Linked Open Data and DANS
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage Information
 
Open data easy, explicit and fast
Open data easy, explicit and fastOpen data easy, explicit and fast
Open data easy, explicit and fast
 
Scripting User Contributed Interlinking
Scripting User Contributed InterlinkingScripting User Contributed Interlinking
Scripting User Contributed Interlinking
 
Pandas
PandasPandas
Pandas
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xml
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked data
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
 
DBPedia-past-present-future
DBPedia-past-present-futureDBPedia-past-present-future
DBPedia-past-present-future
 
NO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloudNO SQL Databases, Big Data and the cloud
NO SQL Databases, Big Data and the cloud
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data Portals
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2Improvement of no sql technology for relational databases v2
Improvement of no sql technology for relational databases v2
 

Similar to Clustering output of Apache Nutch using Apache Spark

How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's students
Mohamed Nadjib MAMI
 
OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6
Peter Tröger
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Evan Casey
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
Markus Klems
 
Distributed Decision Tree Induction
Distributed Decision Tree InductionDistributed Decision Tree Induction
Distributed Decision Tree Induction
gregoryg
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Pramati Technologies
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Anant Corporation
 
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
ArangoDB Database
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
Lex Toumbourou
 
Machine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better RecommendationsMachine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better Recommendations
ChristopherWoodward16
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
Gagan Bajpai
 
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
ArangoDB Database
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
Milind Bhandarkar
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
Conor B. Murphy
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
Concordia university
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
Sujit Pal
 
[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf
4NM20IS025BHUSHANNAY
 

Similar to Clustering output of Apache Nutch using Apache Spark (20)

How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's students
 
OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4j
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Distributed Decision Tree Induction
Distributed Decision Tree InductionDistributed Decision Tree Induction
Distributed Decision Tree Induction
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
 
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
Machine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better RecommendationsMachine Learning + Graph Databases for Better Recommendations
Machine Learning + Graph Databases for Better Recommendations
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
 
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
Machine Learning + Graph Databases for Better Recommendations V2 08/20/2022
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf
 

More from Thamme Gowda

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
Thamme Gowda
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important too
Thamme Gowda
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
Thamme Gowda
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Thamme Gowda
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Thamme Gowda
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
Thamme Gowda
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda
 

More from Thamme Gowda (7)

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important too
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL Internship
 

Recently uploaded

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 

Recently uploaded (20)

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 

Clustering output of Apache Nutch using Apache Spark

  • 1. Clustering the output of Apache Nutch using Apache Spark Thamme Gowda N. Dr. Chris Mattmann May 12, 2016. Vancouver, Canada 1
  • 2. About ● ThammeGowda Narayanaswamy - TG in short - @thammegowda ○ Contributor to Apache Tika and Apache Nutch ○ Now - a grad student @ University of Southern California ○ Past - Technical Co-Founder @ Datoin - http://datoin.com ● Dr. Chris Mattmann @chrismattmann ○ Adj. Prof. and the director of IRDS group @ University of Southern California, Los Angeles ○ Director @ Apache Software Foundation ○ Chief Architect, NASA JPL 2
  • 3. Overview ● Problem Statement ● Clustering - a solution ● Structure and Style Similarity ● Shared Near Neighbor Clustering ● Scaling it up using Spark’s Distributed Matrices and GraphX ● A demo 3
  • 4. Audience ● Who crawls the web ● Who extracts data from web ● Who filters webpages ● likes to know - ○ web page structure and style similarity ○ shared near neighbor clustering 4
  • 5. Problem Statement ● Scraping data from online marketplaces ● Start with homepage → categories →listing pages → Actual stuff (Detail page) ● 5
  • 6. Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov 6
  • 7. Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov USELESS USELESS 7
  • 8. Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov USELESS USELESS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS 8
  • 9. Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov USELESS USELESS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS USEFUL FOR ANALYSIS USEFUL FOR ANALYSIS USEFUL FOR ANALYSIS 9
  • 10. Question : How do we solve this? Answer : Cluster the web pages 10
  • 11. Why Cluster? ● Separate the interesting web pages? ○ Drop uninteresting/noisy web pages ○ Categorical treatment of clusters ● Extract Structured data using XPath ○ Automated extraction using alignment 11
  • 12. Goal ● Group web pages that are similar ● Similar in terms of ○ CSS Styles ○ DOM Structure ● Toolkit for experimentation with various thresholds ○ % of similarity in style and/or structure ○ Nice visualizations 12
  • 13. How do we cluster? ● Based on similarity between pages ● Semantic similarity ○ meaning of the web pages ● Syntactic similarity ○ Web page structure, css styles ● This session has focus on syntactic aspect 13
  • 14. Structural similarity ● Web pages are built with HTML ● HTML Doc → DOM tree ● a labeled ordered tree ● Structural similarity using tree edit distance(TED) HTML HEAD BODY TITLE DIV P 14
  • 15. (Minimum) Tree Edit Distance ● Edit distance measure similar to strings, but on hierarchical data instead of sequences ● Number of editing operations required to transform one tree into another. ● Three basic editing operations: INSERT, REMOVE and REPLACE. ● An useful measure to quantify how similar (or dissimilar) two trees are. 15
  • 16. Example: Tree Edit Distance* ● Edit operations ● Normalized distance * Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262. 16
  • 17. Style Similarity ● Have you noticed ? ○ Similar web pages have similar css styles ● XPath : ”//*[@class]/@class” ● Simple measure - ○ Jaccard Similarity on CSS class names ○ 17
  • 18. Web pages consists of : ● HTML ✓ ● CSS ✓ ● JavaScript × 18
  • 19. Aggregating the Style and Structure ● StructuralSimilarity : Normalized Tree Edit Distance ● StyleSimilarity : Jaccard Distance ● Combine on a linear scale ○ Aggregated = k . Structural + (1-k) Style 19
  • 21. Implementation ● Read Nutch’s Segements ○ sparkContext.sequneceFile(...) ● Filter web pages ○ Robust content type detection -- Tika ● Structural Similarity ○ HTML to DOM Tree -- NeckoHtml ○ Tree Edit Distance -- Zhang Shasha’s algorithm 21
  • 22. Implementation … ● Style Similarity ○ Query CSS class names using Xpath ● Similarity Matrix ○ sparkContext.cartesian() to get nxn cells ○ Spark’s Distributed (Coordinate) Matrix ● Persist the matrix for later experimentation with multiple thresholds 22
  • 23. Clustering ● Shared Near Neighbor Clustering ○ Jarvis et al , 1973 ● With improvements ○ Graph based Implementation ■ Spark GraphX for the win! * Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034. 23
  • 24. What’s good about this algorithm? ● What’s the difficulty with the most popular k-means? ○ Prior knowledge of clusters? ○ Mean/Average of documents in a cluster? ■ Average of DOM Trees? ■ Average of CSS styles? ○ Circular/Spherical/Globular shapes? ● Shared Near Neighbor Cluster ○ Similarity matrix - pluggable similarity measures - generic ○ Thresholds - numbers , percent of match 24
  • 25. Shared Near Neighbor Algorithm “If two data points share a threshold number of neighbors, then they must belong to the same cluster” 25
  • 26. Clustering Implementation ● Similarity Matrix to Graph ○ Clusters as nodes, similarity measure as edges ● Check for Similar neighbors ○ ○ Filter on threshold and Merge ■ Immutable! - new graph for next iteration ○ Repeat 26
  • 27. Shared Near Neighbor Clustering on Apache Spark GraphX 27
  • 28. Challenges ● Tree Edit Distance is very expensive 28
  • 29. What’s ahead on the road? ● Integrate to Apache Nutch ● Auto Extraction ○ Unsupervised learning on structure of pages and scrape the actual data of the web page ● Faster Tree Edit Distance ○ May be with approximation techniques 29
  • 31. Summary ● Example Scenario ● Similarity measures ● Clustering as a solution ● Demo 31
  • 32. Acknowledgements ● Dr. Chris Mattmann ○ My mentor ○ Professor, Director at IRDS @ USC - http://irds.usc.edu ○ Director, Apache Software Foundation ● DARPA Memex project 32
  • 33. Thank You! ● Source Code ● Tutorial ● Follow up ○ Thamme Gowda - @thammegowda ○ Chris Mattmann - @chrismattmann 33