SlideShare a Scribd company logo
July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA
Thamme Gowda
@thammegowda
Dr. Chris Mattmann
@chrismattmann
1
CLUSTERING WEB PAGES BASED ON
STRUCTURE AND STYLE SIMILARITY
Information Retrieval
and Data Science
OUTLINE
• Problem Statement
• Method Overview
• Steps
• Tree Edit Distance
• Style Similarity
• Shared Near Neighbor Clustering
• Evaluation
• Challenges
Information Retrieval
and Data Science
2
PROBLEM STATEMENT
Information Retrieval
and Data Science
3
• Scraping data from online marketplaces
• Start with homepage
→ categories →listing → Actual stuff (Detail page)
SAMPLE WEB PAGES
Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
4
1 2 3 4
8765
USELESS
USELESS
5SAMPLE WEB PAGES
Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
1 2 3 4
8765
USELESS
USELESS
6SAMPLE WEB PAGES
Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
1 2 3 4
8765
USELESS
USELESS
7SAMPLE WEB PAGES
Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
USEFUL USEFUL USEFUL
1 2 3 4
8765
METHOD OVERVIEW
Information Retrieval
and Data Science
8
CLUSTERING
• “task of grouping a set of objects in such a way that objects
in the same group are more similar (in some sense or the
other) to each other than to those in the other groups”
– Wikipedia
• There are many ways to achieve this.
9
Information Retrieval
and Data Science
CLUSTERING
HOW DO WE CLUSTER
Information Retrieval
and Data Science
10
• Based on similarity between pages
• Semantic similarity
• meaning of the web pages (keywords, topics,…)
• Syntactic similarity
• Web page structure, CSS styles
• This presentation has focus on syntactic aspect
• HTML ✓
• CSS ✓
• JavaScript ×
11
Information Retrieval
and Data Science
SIMILARITY CHECK
METHOD : INPUT
Information Retrieval
and Data Science
12
WEB PAGES FROM CRAWLER
LIKE APACHE NUTCH
METHOD : STEP #1
Information Retrieval
and Data Science
13
WEB PAGES FROM CRAWLER
LIKE APACHE NUTCH
STRUCTURAL SIMILARITY
STRUCTURAL SIMILARITY
STRUCTURAL SIMILARITY
Information Retrieval
and Data Science
14
• Web pages are built with
HTML
• HTML Doc → DOM tree
• a labeled ordered tree
• Structural similarity using
tree edit distance(TED)
HTML
HEAD BODY
TITLE DIV P
MINIMUM TREE EDIT DISTANCE
Information Retrieval
and Data Science
15
• Edit distance measure similar to strings, but on
hierarchical data instead of sequences
• Number of editing operations required to transform
one tree into another.
• Three basic editing operations: INSERT, REMOVE and
REPLACE.
• An useful measure to quantify how similar (or
dissimilar) two trees are.
● Edit operations
● Normalized
distance
* Zhang, K., & Shasha, D. (1989).
Simple fast algorithms for the
editing distance between trees
and related problems. SIAM
journal on computing,18(6),
1245-1262.
16
MINIMUM TREE EDIT DISTANCE*
Information Retrieval
and Data Science
1 2
3 4
METHOD : STEP #2
Information Retrieval
and Data Science
17
WEB PAGES FROM CRAWLER
LIKE APACHE NUTCH
STYLE SIMILARITY
STYLE SIMILARITY
• Similar web pages have similar css styles
• XPath : ”//*[@class]/@class”
• Simple measure -
• Jaccard Similarity on CSS class names
18
Information Retrieval
and Data Science
STYLE SIMILARITY
METHOD : STEP #3
Information Retrieval
and Data Science
19
AGGREGATED = k.STRUCTURAL+ (1-k).STYLE
STRUCTURAL
STYLE
METHOD : STEP #4
Information Retrieval
and Data Science
20
SIMILARITY MATRIX CLUSTERS
CLUSTERING
( SHARED NEAR NEIGHBOR)
“If two data points share a threshold number of
neighbors, then they must belong to the same
cluster” *
21
Information Retrieval
and Data Science
SHARED NEAR NEIGHBOR (SNN) ALGORITHM
* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors.
Computers, IEEE Transactions on, 100(11), 1025-1034.
Web Pages
• Guessing k in k-means is hard
Meaningful question - “Make clusters of 90% similarity”
instead of “Make 10 clusters”
• Mean / Average of documents in a cluster?
• Average of DOM Trees?
• Average of CSS styles?
• Circular / Spherical / Globular shapes?
22
Information Retrieval
and Data Science
WHAT’S GOOD ABOUT SNN ALGORITHM
METHOD : LAST STEP*
Information Retrieval
and Data Science
23
LABELING
CLUSTERS CATEGORIES /USABLE CLUSTERS
METHOD : LAST STEP*
Information Retrieval
and Data Science
24
LABELING
CLUSTERS CATEGORIES /USABLE CLUSTERS
* HUMAN INTERVENTION - THIS STEP REQUIRES DOMAIN EXPERTISE
SOME APPLICATIONS?
Information Retrieval
and Data Science
25
• Separate the interesting web pages?
• Drop uninteresting/noisy web pages
• Categorical treatment of clusters
• Extract Structured data using XPath
• Automated extraction using alignment
26
Information Retrieval
and Data Science
WORKFLOW: PART #1
27Information Retrieval
and Data Science
WORKFLOW: PART #2
DATASET :
1310 Web Pages from http://armslist.com
• 987 Ad detail pages
• 311 Ad listing pages
• 12 others – index, contact, FAQs etc
PARAMETERS:
• 50% weightage for CSS style 50% weight for HTML structure
• Series of experiments on various thresholds : 85%, 90%, 95%
Information Retrieval
and Data Science
EVALUATION
28
Information Retrieval
and Data Science
EVALUATION
29
PARAMETERS:
SIMILARITY = 90%
SHARED NEIGHBORS = 90%
Information Retrieval
and Data Science
EVALUATION
30
PARAMETERS:
SIMILARITY = 95%
SHARED NEIGHBORS = 95%
Information Retrieval
and Data Science
EVALUATION
31
PARAMETERS:
SIMILARITY = 85%
SHARED NEIGHBORS = 85%
• TED very expensive
• Zhang-Shasha’s TED
• O(|T1| x |T2|
x Min{depth(T1), leaves(T1)}
x Min{depth(T2), leaves(T2)})
• That’s O(n4)
• Approx. 1000 HTML Tags
• That’s O(1012)
Information Retrieval
and Data Science
CHALLENGES
32
Number of HTML Tags
TimeComplexity
Information Retrieval
and Data Science
ACKNOWLEDGMENTS
DARPA MEMEX
33
* Photo Credits : http://memex.jpl.nasa.gov/
• Source Code
https://github.com/USCDataScience/autoextractor
• Tutorial
https://git.io/vwS69
• Follow up
• Thamme Gowda - @thammegowda
• Chris Mattmann - @chrismattmann
34
Information Retrieval
and Data Science
THANK YOU

More Related Content

What's hot

An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture Search
Bill Liu
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
HJ van Veen
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
Suresh Arora
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDB
valuebound
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
JWORKS powered by Ordina
 
Machine learning
Machine learningMachine learning
Machine learning
vaishnavip23
 
Building AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge Graphs
Andre Freitas
 
Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learning
Tom Dierickx
 
PostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performancePostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performance
Vladimir Sitnikov
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
NVIDIA Japan
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
Hichem Felouat
 
Reactive design: languages, and paradigms
Reactive design: languages, and paradigmsReactive design: languages, and paradigms
Reactive design: languages, and paradigms
Dean Wampler
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
Maria Stylianou
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Ashray Bhandare
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
Tonmoy Bhagawati
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
Shubhmay Potdar
 
Implementing Highly Performant Distributed Aggregates
Implementing Highly Performant Distributed AggregatesImplementing Highly Performant Distributed Aggregates
Implementing Highly Performant Distributed Aggregates
ScyllaDB
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine Learning
Haptik
 
Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms
butest
 

What's hot (20)

An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture Search
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDB
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
 
Machine learning
Machine learningMachine learning
Machine learning
 
Building AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge Graphs
 
Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learning
 
PostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performancePostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performance
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
Reactive design: languages, and paradigms
Reactive design: languages, and paradigmsReactive design: languages, and paradigms
Reactive design: languages, and paradigms
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Implementing Highly Performant Distributed Aggregates
Implementing Highly Performant Distributed AggregatesImplementing Highly Performant Distributed Aggregates
Implementing Highly Performant Distributed Aggregates
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine Learning
 
Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms
 

Viewers also liked

College Profile
College ProfileCollege Profile
College Profile
IPGenius Inc.
 
318157119 the-village
318157119 the-village318157119 the-village
318157119 the-village
hayat alishah
 
2008 utpp po2
2008 utpp po22008 utpp po2
2008 utpp po2
hayat alishah
 
Bl fam 239011
Bl fam 239011Bl fam 239011
Bl fam 239011
hayat alishah
 
Лид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Лид скоринг или Цикл принятия решения о покупке в b2b-сегментеЛид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Лид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Vladyslava Rykova
 
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
ZSL Biodiversity & Palm Oil Platform
 
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
ZSL Biodiversity & Palm Oil Platform
 
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
ZSL Biodiversity & Palm Oil Platform
 
FDHS Class of 1992 Over the Years
FDHS Class of 1992 Over the YearsFDHS Class of 1992 Over the Years
FDHS Class of 1992 Over the Years
imbarefootin
 
Detector movemento cdm 180
Detector movemento cdm 180Detector movemento cdm 180
Detector movemento cdm 180
Xosé Manoel Álvarez López
 
The quest for the Entrepreneurial North Star
The quest for the Entrepreneurial North StarThe quest for the Entrepreneurial North Star
The quest for the Entrepreneurial North Star
Ruta Aidis
 
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
Vladyslava Rykova
 
Global Social Media Statistics 2012
Global Social Media Statistics 2012Global Social Media Statistics 2012
Global Social Media Statistics 2012
Harsh Wardhan Dave
 
Клонирование интернет-магазинов. Сайты-аффилиаты
Клонирование интернет-магазинов. Сайты-аффилиатыКлонирование интернет-магазинов. Сайты-аффилиаты
Клонирование интернет-магазинов. Сайты-аффилиаты
Vladyslava Rykova
 
33 Ways to Save Money
33 Ways to Save Money33 Ways to Save Money
The Maine Trial Lawyers Association
The Maine Trial Lawyers AssociationThe Maine Trial Lawyers Association
The Maine Trial Lawyers Association
Holmes Legal Group
 
Summerifeld84 lt
Summerifeld84 ltSummerifeld84 lt
Summerifeld84 lt
summerfield84
 
Q2 adp 2015-16 sectoral format for sports
Q2 adp 2015-16 sectoral format for sportsQ2 adp 2015-16 sectoral format for sports
Q2 adp 2015-16 sectoral format for sports
hayat alishah
 
What Mobile Users Want
What Mobile Users WantWhat Mobile Users Want
What Mobile Users Want
Harsh Wardhan Dave
 

Viewers also liked (20)

College Profile
College ProfileCollege Profile
College Profile
 
318157119 the-village
318157119 the-village318157119 the-village
318157119 the-village
 
2008 utpp po2
2008 utpp po22008 utpp po2
2008 utpp po2
 
Bl fam 239011
Bl fam 239011Bl fam 239011
Bl fam 239011
 
Лид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Лид скоринг или Цикл принятия решения о покупке в b2b-сегментеЛид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Лид скоринг или Цикл принятия решения о покупке в b2b-сегменте
 
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
 
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
 
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
 
FDHS Class of 1992 Over the Years
FDHS Class of 1992 Over the YearsFDHS Class of 1992 Over the Years
FDHS Class of 1992 Over the Years
 
Истоки (2008 год)
Истоки (2008 год)Истоки (2008 год)
Истоки (2008 год)
 
Detector movemento cdm 180
Detector movemento cdm 180Detector movemento cdm 180
Detector movemento cdm 180
 
The quest for the Entrepreneurial North Star
The quest for the Entrepreneurial North StarThe quest for the Entrepreneurial North Star
The quest for the Entrepreneurial North Star
 
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
 
Global Social Media Statistics 2012
Global Social Media Statistics 2012Global Social Media Statistics 2012
Global Social Media Statistics 2012
 
Клонирование интернет-магазинов. Сайты-аффилиаты
Клонирование интернет-магазинов. Сайты-аффилиатыКлонирование интернет-магазинов. Сайты-аффилиаты
Клонирование интернет-магазинов. Сайты-аффилиаты
 
33 Ways to Save Money
33 Ways to Save Money33 Ways to Save Money
33 Ways to Save Money
 
The Maine Trial Lawyers Association
The Maine Trial Lawyers AssociationThe Maine Trial Lawyers Association
The Maine Trial Lawyers Association
 
Summerifeld84 lt
Summerifeld84 ltSummerifeld84 lt
Summerifeld84 lt
 
Q2 adp 2015-16 sectoral format for sports
Q2 adp 2015-16 sectoral format for sportsQ2 adp 2015-16 sectoral format for sports
Q2 adp 2015-16 sectoral format for sports
 
What Mobile Users Want
What Mobile Users WantWhat Mobile Users Want
What Mobile Users Want
 

Similar to IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
Thamme Gowda
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
tafosepsdfasg
 
Project 0th Review
Project 0th ReviewProject 0th Review
Project 0th Review
Divakar Raj M
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
malathieswaran29
 
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Artem Chebotko
 
Yogesh Waghode Data-Mining-ppt seminar report
Yogesh Waghode Data-Mining-ppt seminar reportYogesh Waghode Data-Mining-ppt seminar report
Yogesh Waghode Data-Mining-ppt seminar report
yogeshvw56
 
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
adeel8937
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
eXascale Infolab
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don Day
Information Development World
 
Bab 1 : Pengenalan Pangkalan data .pptx
Bab 1 : Pengenalan Pangkalan data  .pptxBab 1 : Pengenalan Pangkalan data  .pptx
Bab 1 : Pengenalan Pangkalan data .pptx
AnitaAPSupramaniam
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
hktripathy
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
Richard Garris
 
Unit iii
Unit iiiUnit iii
Unit iii
Kgr Sushmitha
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
The Statistical and Applied Mathematical Sciences Institute
 
Lecture1
Lecture1Lecture1
Lecture1
Manish Singh
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
IdontKnow66967
 
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
ScyllaDB
 
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Anastasija Nikiforova
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
Robert Dempsey
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
nayanakarsh469
 

Similar to IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity (20)

Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Project 0th Review
Project 0th ReviewProject 0th Review
Project 0th Review
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
 
Yogesh Waghode Data-Mining-ppt seminar report
Yogesh Waghode Data-Mining-ppt seminar reportYogesh Waghode Data-Mining-ppt seminar report
Yogesh Waghode Data-Mining-ppt seminar report
 
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don Day
 
Bab 1 : Pengenalan Pangkalan data .pptx
Bab 1 : Pengenalan Pangkalan data  .pptxBab 1 : Pengenalan Pangkalan data  .pptx
Bab 1 : Pengenalan Pangkalan data .pptx
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Unit iii
Unit iiiUnit iii
Unit iii
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Lecture1
Lecture1Lecture1
Lecture1
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
 
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 

More from Thamme Gowda

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
Thamme Gowda
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important too
Thamme Gowda
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
Thamme Gowda
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Thamme Gowda
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Thamme Gowda
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
Thamme Gowda
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
Thamme Gowda
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda
 

More from Thamme Gowda (8)

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important too
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL Internship
 

Recently uploaded

CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdfCMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
IndranilDasgupta19
 
High Girls Call Mohali 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Mohali 000XX00000 Provide Best And Top Girl Service And No1 i...High Girls Call Mohali 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Mohali 000XX00000 Provide Best And Top Girl Service And No1 i...
gargjiya84
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Alexander Teggin
 
potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
huseindihon
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
uapta
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
lenjisoHussein
 
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
sheetal singh$A17
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
NABLAS株式会社
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
kinni singh$A17
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
huseindihon
 
🚂🚘 Premium Girls Call Guwahati 🛵🚡000XX00000 💃 Choose Best And Top Girl Servi...
🚂🚘 Premium Girls Call Guwahati  🛵🚡000XX00000 💃 Choose Best And Top Girl Servi...🚂🚘 Premium Girls Call Guwahati  🛵🚡000XX00000 💃 Choose Best And Top Girl Servi...
🚂🚘 Premium Girls Call Guwahati 🛵🚡000XX00000 💃 Choose Best And Top Girl Servi...
kuldeepsharmaks8120
 
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
fatima shekh$A17
 
Cyber Insurance Mathematical Model & Pricing
Cyber Insurance Mathematical Model & PricingCyber Insurance Mathematical Model & Pricing
Cyber Insurance Mathematical Model & Pricing
BaraDaniel1
 
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
sheetal singh$A17
 
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
6459astrid
 
High Profile Girls Call Delhi 🛵🚡9711199171 💃 Choose Best And Top Girl Service...
High Profile Girls Call Delhi 🛵🚡9711199171 💃 Choose Best And Top Girl Service...High Profile Girls Call Delhi 🛵🚡9711199171 💃 Choose Best And Top Girl Service...
High Profile Girls Call Delhi 🛵🚡9711199171 💃 Choose Best And Top Girl Service...
kinni singh$A17
 
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
revolutionary575
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
AnujaGaikwad28
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
tanupasswan6
 
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
Grant McAlister
 

Recently uploaded (20)

CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdfCMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
 
High Girls Call Mohali 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Mohali 000XX00000 Provide Best And Top Girl Service And No1 i...High Girls Call Mohali 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Mohali 000XX00000 Provide Best And Top Girl Service And No1 i...
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
 
potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
 
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
 
🚂🚘 Premium Girls Call Guwahati 🛵🚡000XX00000 💃 Choose Best And Top Girl Servi...
🚂🚘 Premium Girls Call Guwahati  🛵🚡000XX00000 💃 Choose Best And Top Girl Servi...🚂🚘 Premium Girls Call Guwahati  🛵🚡000XX00000 💃 Choose Best And Top Girl Servi...
🚂🚘 Premium Girls Call Guwahati 🛵🚡000XX00000 💃 Choose Best And Top Girl Servi...
 
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
 
Cyber Insurance Mathematical Model & Pricing
Cyber Insurance Mathematical Model & PricingCyber Insurance Mathematical Model & Pricing
Cyber Insurance Mathematical Model & Pricing
 
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
 
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
 
High Profile Girls Call Delhi 🛵🚡9711199171 💃 Choose Best And Top Girl Service...
High Profile Girls Call Delhi 🛵🚡9711199171 💃 Choose Best And Top Girl Service...High Profile Girls Call Delhi 🛵🚡9711199171 💃 Choose Best And Top Girl Service...
High Profile Girls Call Delhi 🛵🚡9711199171 💃 Choose Best And Top Girl Service...
 
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
 
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
 

IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

  • 1. July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA Thamme Gowda @thammegowda Dr. Chris Mattmann @chrismattmann 1 CLUSTERING WEB PAGES BASED ON STRUCTURE AND STYLE SIMILARITY Information Retrieval and Data Science
  • 2. OUTLINE • Problem Statement • Method Overview • Steps • Tree Edit Distance • Style Similarity • Shared Near Neighbor Clustering • Evaluation • Challenges Information Retrieval and Data Science 2
  • 3. PROBLEM STATEMENT Information Retrieval and Data Science 3 • Scraping data from online marketplaces • Start with homepage → categories →listing → Actual stuff (Detail page)
  • 4. SAMPLE WEB PAGES Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov 4 1 2 3 4 8765
  • 5. USELESS USELESS 5SAMPLE WEB PAGES Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov 1 2 3 4 8765
  • 6. USELESS USELESS 6SAMPLE WEB PAGES Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov CRAWLER: YES ANALYSIS: NO CRAWLER: YES ANALYSIS: NO CRAWLER: YES ANALYSIS: NO 1 2 3 4 8765
  • 7. USELESS USELESS 7SAMPLE WEB PAGES Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov CRAWLER: YES ANALYSIS: NO CRAWLER: YES ANALYSIS: NO CRAWLER: YES ANALYSIS: NO USEFUL USEFUL USEFUL 1 2 3 4 8765
  • 8. METHOD OVERVIEW Information Retrieval and Data Science 8 CLUSTERING
  • 9. • “task of grouping a set of objects in such a way that objects in the same group are more similar (in some sense or the other) to each other than to those in the other groups” – Wikipedia • There are many ways to achieve this. 9 Information Retrieval and Data Science CLUSTERING
  • 10. HOW DO WE CLUSTER Information Retrieval and Data Science 10 • Based on similarity between pages • Semantic similarity • meaning of the web pages (keywords, topics,…) • Syntactic similarity • Web page structure, CSS styles • This presentation has focus on syntactic aspect
  • 11. • HTML ✓ • CSS ✓ • JavaScript × 11 Information Retrieval and Data Science SIMILARITY CHECK
  • 12. METHOD : INPUT Information Retrieval and Data Science 12 WEB PAGES FROM CRAWLER LIKE APACHE NUTCH
  • 13. METHOD : STEP #1 Information Retrieval and Data Science 13 WEB PAGES FROM CRAWLER LIKE APACHE NUTCH STRUCTURAL SIMILARITY STRUCTURAL SIMILARITY
  • 14. STRUCTURAL SIMILARITY Information Retrieval and Data Science 14 • Web pages are built with HTML • HTML Doc → DOM tree • a labeled ordered tree • Structural similarity using tree edit distance(TED) HTML HEAD BODY TITLE DIV P
  • 15. MINIMUM TREE EDIT DISTANCE Information Retrieval and Data Science 15 • Edit distance measure similar to strings, but on hierarchical data instead of sequences • Number of editing operations required to transform one tree into another. • Three basic editing operations: INSERT, REMOVE and REPLACE. • An useful measure to quantify how similar (or dissimilar) two trees are.
  • 16. ● Edit operations ● Normalized distance * Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262. 16 MINIMUM TREE EDIT DISTANCE* Information Retrieval and Data Science 1 2 3 4
  • 17. METHOD : STEP #2 Information Retrieval and Data Science 17 WEB PAGES FROM CRAWLER LIKE APACHE NUTCH STYLE SIMILARITY STYLE SIMILARITY
  • 18. • Similar web pages have similar css styles • XPath : ”//*[@class]/@class” • Simple measure - • Jaccard Similarity on CSS class names 18 Information Retrieval and Data Science STYLE SIMILARITY
  • 19. METHOD : STEP #3 Information Retrieval and Data Science 19 AGGREGATED = k.STRUCTURAL+ (1-k).STYLE STRUCTURAL STYLE
  • 20. METHOD : STEP #4 Information Retrieval and Data Science 20 SIMILARITY MATRIX CLUSTERS CLUSTERING ( SHARED NEAR NEIGHBOR)
  • 21. “If two data points share a threshold number of neighbors, then they must belong to the same cluster” * 21 Information Retrieval and Data Science SHARED NEAR NEIGHBOR (SNN) ALGORITHM * Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034. Web Pages
  • 22. • Guessing k in k-means is hard Meaningful question - “Make clusters of 90% similarity” instead of “Make 10 clusters” • Mean / Average of documents in a cluster? • Average of DOM Trees? • Average of CSS styles? • Circular / Spherical / Globular shapes? 22 Information Retrieval and Data Science WHAT’S GOOD ABOUT SNN ALGORITHM
  • 23. METHOD : LAST STEP* Information Retrieval and Data Science 23 LABELING CLUSTERS CATEGORIES /USABLE CLUSTERS
  • 24. METHOD : LAST STEP* Information Retrieval and Data Science 24 LABELING CLUSTERS CATEGORIES /USABLE CLUSTERS * HUMAN INTERVENTION - THIS STEP REQUIRES DOMAIN EXPERTISE
  • 25. SOME APPLICATIONS? Information Retrieval and Data Science 25 • Separate the interesting web pages? • Drop uninteresting/noisy web pages • Categorical treatment of clusters • Extract Structured data using XPath • Automated extraction using alignment
  • 26. 26 Information Retrieval and Data Science WORKFLOW: PART #1
  • 27. 27Information Retrieval and Data Science WORKFLOW: PART #2
  • 28. DATASET : 1310 Web Pages from http://armslist.com • 987 Ad detail pages • 311 Ad listing pages • 12 others – index, contact, FAQs etc PARAMETERS: • 50% weightage for CSS style 50% weight for HTML structure • Series of experiments on various thresholds : 85%, 90%, 95% Information Retrieval and Data Science EVALUATION 28
  • 29. Information Retrieval and Data Science EVALUATION 29 PARAMETERS: SIMILARITY = 90% SHARED NEIGHBORS = 90%
  • 30. Information Retrieval and Data Science EVALUATION 30 PARAMETERS: SIMILARITY = 95% SHARED NEIGHBORS = 95%
  • 31. Information Retrieval and Data Science EVALUATION 31 PARAMETERS: SIMILARITY = 85% SHARED NEIGHBORS = 85%
  • 32. • TED very expensive • Zhang-Shasha’s TED • O(|T1| x |T2| x Min{depth(T1), leaves(T1)} x Min{depth(T2), leaves(T2)}) • That’s O(n4) • Approx. 1000 HTML Tags • That’s O(1012) Information Retrieval and Data Science CHALLENGES 32 Number of HTML Tags TimeComplexity
  • 33. Information Retrieval and Data Science ACKNOWLEDGMENTS DARPA MEMEX 33 * Photo Credits : http://memex.jpl.nasa.gov/
  • 34. • Source Code https://github.com/USCDataScience/autoextractor • Tutorial https://git.io/vwS69 • Follow up • Thamme Gowda - @thammegowda • Chris Mattmann - @chrismattmann 34 Information Retrieval and Data Science THANK YOU

Editor's Notes

  1. Base version Three variants to illustrate three diffrent operations Count all these operations -