SlideShare a Scribd company logo
1 of 34
July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA
Thamme Gowda
@thammegowda
Dr. Chris Mattmann
@chrismattmann
1
CLUSTERING WEB PAGES BASED ON
STRUCTURE AND STYLE SIMILARITY
Information Retrieval
and Data Science
OUTLINE
• Problem Statement
• Method Overview
• Steps
• Tree Edit Distance
• Style Similarity
• Shared Near Neighbor Clustering
• Evaluation
• Challenges
Information Retrieval
and Data Science
2
PROBLEM STATEMENT
Information Retrieval
and Data Science
3
• Scraping data from online marketplaces
• Start with homepage
→ categories →listing → Actual stuff (Detail page)
SAMPLE WEB PAGES
Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
4
1 2 3 4
8765
USELESS
USELESS
5SAMPLE WEB PAGES
Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
1 2 3 4
8765
USELESS
USELESS
6SAMPLE WEB PAGES
Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
1 2 3 4
8765
USELESS
USELESS
7SAMPLE WEB PAGES
Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
USEFUL USEFUL USEFUL
1 2 3 4
8765
METHOD OVERVIEW
Information Retrieval
and Data Science
8
CLUSTERING
• “task of grouping a set of objects in such a way that objects
in the same group are more similar (in some sense or the
other) to each other than to those in the other groups”
– Wikipedia
• There are many ways to achieve this.
9
Information Retrieval
and Data Science
CLUSTERING
HOW DO WE CLUSTER
Information Retrieval
and Data Science
10
• Based on similarity between pages
• Semantic similarity
• meaning of the web pages (keywords, topics,…)
• Syntactic similarity
• Web page structure, CSS styles
• This presentation has focus on syntactic aspect
• HTML ✓
• CSS ✓
• JavaScript ×
11
Information Retrieval
and Data Science
SIMILARITY CHECK
METHOD : INPUT
Information Retrieval
and Data Science
12
WEB PAGES FROM CRAWLER
LIKE APACHE NUTCH
METHOD : STEP #1
Information Retrieval
and Data Science
13
WEB PAGES FROM CRAWLER
LIKE APACHE NUTCH
STRUCTURAL SIMILARITY
STRUCTURAL SIMILARITY
STRUCTURAL SIMILARITY
Information Retrieval
and Data Science
14
• Web pages are built with
HTML
• HTML Doc → DOM tree
• a labeled ordered tree
• Structural similarity using
tree edit distance(TED)
HTML
HEAD BODY
TITLE DIV P
MINIMUM TREE EDIT DISTANCE
Information Retrieval
and Data Science
15
• Edit distance measure similar to strings, but on
hierarchical data instead of sequences
• Number of editing operations required to transform
one tree into another.
• Three basic editing operations: INSERT, REMOVE and
REPLACE.
• An useful measure to quantify how similar (or
dissimilar) two trees are.
● Edit operations
● Normalized
distance
* Zhang, K., & Shasha, D. (1989).
Simple fast algorithms for the
editing distance between trees
and related problems. SIAM
journal on computing,18(6),
1245-1262.
16
MINIMUM TREE EDIT DISTANCE*
Information Retrieval
and Data Science
1 2
3 4
METHOD : STEP #2
Information Retrieval
and Data Science
17
WEB PAGES FROM CRAWLER
LIKE APACHE NUTCH
STYLE SIMILARITY
STYLE SIMILARITY
• Similar web pages have similar css styles
• XPath : ”//*[@class]/@class”
• Simple measure -
• Jaccard Similarity on CSS class names
18
Information Retrieval
and Data Science
STYLE SIMILARITY
METHOD : STEP #3
Information Retrieval
and Data Science
19
AGGREGATED = k.STRUCTURAL+ (1-k).STYLE
STRUCTURAL
STYLE
METHOD : STEP #4
Information Retrieval
and Data Science
20
SIMILARITY MATRIX CLUSTERS
CLUSTERING
( SHARED NEAR NEIGHBOR)
“If two data points share a threshold number of
neighbors, then they must belong to the same
cluster” *
21
Information Retrieval
and Data Science
SHARED NEAR NEIGHBOR (SNN) ALGORITHM
* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors.
Computers, IEEE Transactions on, 100(11), 1025-1034.
Web Pages
• Guessing k in k-means is hard
Meaningful question - “Make clusters of 90% similarity”
instead of “Make 10 clusters”
• Mean / Average of documents in a cluster?
• Average of DOM Trees?
• Average of CSS styles?
• Circular / Spherical / Globular shapes?
22
Information Retrieval
and Data Science
WHAT’S GOOD ABOUT SNN ALGORITHM
METHOD : LAST STEP*
Information Retrieval
and Data Science
23
LABELING
CLUSTERS CATEGORIES /USABLE CLUSTERS
METHOD : LAST STEP*
Information Retrieval
and Data Science
24
LABELING
CLUSTERS CATEGORIES /USABLE CLUSTERS
* HUMAN INTERVENTION - THIS STEP REQUIRES DOMAIN EXPERTISE
SOME APPLICATIONS?
Information Retrieval
and Data Science
25
• Separate the interesting web pages?
• Drop uninteresting/noisy web pages
• Categorical treatment of clusters
• Extract Structured data using XPath
• Automated extraction using alignment
26
Information Retrieval
and Data Science
WORKFLOW: PART #1
27Information Retrieval
and Data Science
WORKFLOW: PART #2
DATASET :
1310 Web Pages from http://armslist.com
• 987 Ad detail pages
• 311 Ad listing pages
• 12 others – index, contact, FAQs etc
PARAMETERS:
• 50% weightage for CSS style 50% weight for HTML structure
• Series of experiments on various thresholds : 85%, 90%, 95%
Information Retrieval
and Data Science
EVALUATION
28
Information Retrieval
and Data Science
EVALUATION
29
PARAMETERS:
SIMILARITY = 90%
SHARED NEIGHBORS = 90%
Information Retrieval
and Data Science
EVALUATION
30
PARAMETERS:
SIMILARITY = 95%
SHARED NEIGHBORS = 95%
Information Retrieval
and Data Science
EVALUATION
31
PARAMETERS:
SIMILARITY = 85%
SHARED NEIGHBORS = 85%
• TED very expensive
• Zhang-Shasha’s TED
• O(|T1| x |T2|
x Min{depth(T1), leaves(T1)}
x Min{depth(T2), leaves(T2)})
• That’s O(n4)
• Approx. 1000 HTML Tags
• That’s O(1012)
Information Retrieval
and Data Science
CHALLENGES
32
Number of HTML Tags
TimeComplexity
Information Retrieval
and Data Science
ACKNOWLEDGMENTS
DARPA MEMEX
33
* Photo Credits : http://memex.jpl.nasa.gov/
• Source Code
https://github.com/USCDataScience/autoextractor
• Tutorial
https://git.io/vwS69
• Follow up
• Thamme Gowda - @thammegowda
• Chris Mattmann - @chrismattmann
34
Information Retrieval
and Data Science
THANK YOU

More Related Content

What's hot

Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberXiang Fu
 
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인Jae Young Park
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialXavier Amatriain
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
Redis Streams for Event-Driven Microservices
Redis Streams for Event-Driven MicroservicesRedis Streams for Event-Driven Microservices
Redis Streams for Event-Driven MicroservicesRedis Labs
 
[2018 데이터야놀자] 웹크롤링 좀 더 잘하기
[2018 데이터야놀자] 웹크롤링 좀 더 잘하기[2018 데이터야놀자] 웹크롤링 좀 더 잘하기
[2018 데이터야놀자] 웹크롤링 좀 더 잘하기wangwon Lee
 
Social Media Mining - Chapter 9 (Recommendation in Social Media)
Social Media Mining - Chapter 9 (Recommendation in Social Media)Social Media Mining - Chapter 9 (Recommendation in Social Media)
Social Media Mining - Chapter 9 (Recommendation in Social Media)SocialMediaMining
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
[AI & DevOps] BigData Scale Production AI 서비스를 위한 최상의 플랫폼 아키텍처
[AI & DevOps] BigData Scale Production AI 서비스를 위한 최상의 플랫폼 아키텍처[AI & DevOps] BigData Scale Production AI 서비스를 위한 최상의 플랫폼 아키텍처
[AI & DevOps] BigData Scale Production AI 서비스를 위한 최상의 플랫폼 아키텍처hoondong kim
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
Graph Based Recommendation Systems at eBay
Graph Based Recommendation Systems at eBayGraph Based Recommendation Systems at eBay
Graph Based Recommendation Systems at eBayDataStax Academy
 
Webpage Classification
Webpage ClassificationWebpage Classification
Webpage ClassificationPacharaStudio
 

What's hot (20)

Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Data storytelling
Data storytellingData storytelling
Data storytelling
 
Apache flink
Apache flinkApache flink
Apache flink
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
[DevGround] 린하게 구축하는 스타트업 데이터파이프라인
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
Redis Streams for Event-Driven Microservices
Redis Streams for Event-Driven MicroservicesRedis Streams for Event-Driven Microservices
Redis Streams for Event-Driven Microservices
 
[2018 데이터야놀자] 웹크롤링 좀 더 잘하기
[2018 데이터야놀자] 웹크롤링 좀 더 잘하기[2018 데이터야놀자] 웹크롤링 좀 더 잘하기
[2018 데이터야놀자] 웹크롤링 좀 더 잘하기
 
Social Media Mining - Chapter 9 (Recommendation in Social Media)
Social Media Mining - Chapter 9 (Recommendation in Social Media)Social Media Mining - Chapter 9 (Recommendation in Social Media)
Social Media Mining - Chapter 9 (Recommendation in Social Media)
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
[AI & DevOps] BigData Scale Production AI 서비스를 위한 최상의 플랫폼 아키텍처
[AI & DevOps] BigData Scale Production AI 서비스를 위한 최상의 플랫폼 아키텍처[AI & DevOps] BigData Scale Production AI 서비스를 위한 최상의 플랫폼 아키텍처
[AI & DevOps] BigData Scale Production AI 서비스를 위한 최상의 플랫폼 아키텍처
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Graph Based Recommendation Systems at eBay
Graph Based Recommendation Systems at eBayGraph Based Recommendation Systems at eBay
Graph Based Recommendation Systems at eBay
 
Webpage Classification
Webpage ClassificationWebpage Classification
Webpage Classification
 

Viewers also liked

318157119 the-village
318157119 the-village318157119 the-village
318157119 the-villagehayat alishah
 
Лид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Лид скоринг или Цикл принятия решения о покупке в b2b-сегментеЛид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Лид скоринг или Цикл принятия решения о покупке в b2b-сегментеVladyslava Rykova
 
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...ZSL Biodiversity & Palm Oil Platform
 
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471ZSL Biodiversity & Palm Oil Platform
 
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...ZSL Biodiversity & Palm Oil Platform
 
FDHS Class of 1992 Over the Years
FDHS Class of 1992 Over the YearsFDHS Class of 1992 Over the Years
FDHS Class of 1992 Over the Yearsimbarefootin
 
The quest for the Entrepreneurial North Star
The quest for the Entrepreneurial North StarThe quest for the Entrepreneurial North Star
The quest for the Entrepreneurial North StarRuta Aidis
 
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?Vladyslava Rykova
 
Global Social Media Statistics 2012
Global Social Media Statistics 2012Global Social Media Statistics 2012
Global Social Media Statistics 2012Harsh Wardhan Dave
 
Клонирование интернет-магазинов. Сайты-аффилиаты
Клонирование интернет-магазинов. Сайты-аффилиатыКлонирование интернет-магазинов. Сайты-аффилиаты
Клонирование интернет-магазинов. Сайты-аффилиатыVladyslava Rykova
 
The Maine Trial Lawyers Association
The Maine Trial Lawyers AssociationThe Maine Trial Lawyers Association
The Maine Trial Lawyers AssociationHolmes Legal Group
 
Q2 adp 2015-16 sectoral format for sports
Q2 adp 2015-16 sectoral format for sportsQ2 adp 2015-16 sectoral format for sports
Q2 adp 2015-16 sectoral format for sportshayat alishah
 

Viewers also liked (20)

College Profile
College ProfileCollege Profile
College Profile
 
318157119 the-village
318157119 the-village318157119 the-village
318157119 the-village
 
2008 utpp po2
2008 utpp po22008 utpp po2
2008 utpp po2
 
Bl fam 239011
Bl fam 239011Bl fam 239011
Bl fam 239011
 
Лид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Лид скоринг или Цикл принятия решения о покупке в b2b-сегментеЛид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Лид скоринг или Цикл принятия решения о покупке в b2b-сегменте
 
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
 
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
 
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
 
FDHS Class of 1992 Over the Years
FDHS Class of 1992 Over the YearsFDHS Class of 1992 Over the Years
FDHS Class of 1992 Over the Years
 
Истоки (2008 год)
Истоки (2008 год)Истоки (2008 год)
Истоки (2008 год)
 
Detector movemento cdm 180
Detector movemento cdm 180Detector movemento cdm 180
Detector movemento cdm 180
 
The quest for the Entrepreneurial North Star
The quest for the Entrepreneurial North StarThe quest for the Entrepreneurial North Star
The quest for the Entrepreneurial North Star
 
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
 
Global Social Media Statistics 2012
Global Social Media Statistics 2012Global Social Media Statistics 2012
Global Social Media Statistics 2012
 
Клонирование интернет-магазинов. Сайты-аффилиаты
Клонирование интернет-магазинов. Сайты-аффилиатыКлонирование интернет-магазинов. Сайты-аффилиаты
Клонирование интернет-магазинов. Сайты-аффилиаты
 
33 Ways to Save Money
33 Ways to Save Money33 Ways to Save Money
33 Ways to Save Money
 
The Maine Trial Lawyers Association
The Maine Trial Lawyers AssociationThe Maine Trial Lawyers Association
The Maine Trial Lawyers Association
 
Summerifeld84 lt
Summerifeld84 ltSummerifeld84 lt
Summerifeld84 lt
 
Q2 adp 2015-16 sectoral format for sports
Q2 adp 2015-16 sectoral format for sportsQ2 adp 2015-16 sectoral format for sports
Q2 adp 2015-16 sectoral format for sports
 
What Mobile Users Want
What Mobile Users WantWhat Mobile Users Want
What Mobile Users Want
 

Similar to IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkThamme Gowda
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1malathieswaran29
 
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...Artem Chebotko
 
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...adeel8937
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayInformation Development World
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...ScyllaDB
 
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Anastasija Nikiforova
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudRobert Dempsey
 

Similar to IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity (20)

Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Project 0th Review
Project 0th ReviewProject 0th Review
Project 0th Review
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
 
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don Day
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Unit iii
Unit iiiUnit iii
Unit iii
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Lecture1
Lecture1Lecture1
Lecture1
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
 
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
UNIT01-DBMS.ppt
UNIT01-DBMS.pptUNIT01-DBMS.ppt
UNIT01-DBMS.ppt
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 

More from Thamme Gowda

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important tooThamme Gowda
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation ModelThamme Gowda
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Thamme Gowda
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGThamme Gowda
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Thamme Gowda
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler Thamme Gowda
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda
 

More from Thamme Gowda (8)

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important too
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL Internship
 

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 

IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

  • 1. July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA Thamme Gowda @thammegowda Dr. Chris Mattmann @chrismattmann 1 CLUSTERING WEB PAGES BASED ON STRUCTURE AND STYLE SIMILARITY Information Retrieval and Data Science
  • 2. OUTLINE • Problem Statement • Method Overview • Steps • Tree Edit Distance • Style Similarity • Shared Near Neighbor Clustering • Evaluation • Challenges Information Retrieval and Data Science 2
  • 3. PROBLEM STATEMENT Information Retrieval and Data Science 3 • Scraping data from online marketplaces • Start with homepage → categories →listing → Actual stuff (Detail page)
  • 4. SAMPLE WEB PAGES Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov 4 1 2 3 4 8765
  • 5. USELESS USELESS 5SAMPLE WEB PAGES Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov 1 2 3 4 8765
  • 6. USELESS USELESS 6SAMPLE WEB PAGES Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov CRAWLER: YES ANALYSIS: NO CRAWLER: YES ANALYSIS: NO CRAWLER: YES ANALYSIS: NO 1 2 3 4 8765
  • 7. USELESS USELESS 7SAMPLE WEB PAGES Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov CRAWLER: YES ANALYSIS: NO CRAWLER: YES ANALYSIS: NO CRAWLER: YES ANALYSIS: NO USEFUL USEFUL USEFUL 1 2 3 4 8765
  • 8. METHOD OVERVIEW Information Retrieval and Data Science 8 CLUSTERING
  • 9. • “task of grouping a set of objects in such a way that objects in the same group are more similar (in some sense or the other) to each other than to those in the other groups” – Wikipedia • There are many ways to achieve this. 9 Information Retrieval and Data Science CLUSTERING
  • 10. HOW DO WE CLUSTER Information Retrieval and Data Science 10 • Based on similarity between pages • Semantic similarity • meaning of the web pages (keywords, topics,…) • Syntactic similarity • Web page structure, CSS styles • This presentation has focus on syntactic aspect
  • 11. • HTML ✓ • CSS ✓ • JavaScript × 11 Information Retrieval and Data Science SIMILARITY CHECK
  • 12. METHOD : INPUT Information Retrieval and Data Science 12 WEB PAGES FROM CRAWLER LIKE APACHE NUTCH
  • 13. METHOD : STEP #1 Information Retrieval and Data Science 13 WEB PAGES FROM CRAWLER LIKE APACHE NUTCH STRUCTURAL SIMILARITY STRUCTURAL SIMILARITY
  • 14. STRUCTURAL SIMILARITY Information Retrieval and Data Science 14 • Web pages are built with HTML • HTML Doc → DOM tree • a labeled ordered tree • Structural similarity using tree edit distance(TED) HTML HEAD BODY TITLE DIV P
  • 15. MINIMUM TREE EDIT DISTANCE Information Retrieval and Data Science 15 • Edit distance measure similar to strings, but on hierarchical data instead of sequences • Number of editing operations required to transform one tree into another. • Three basic editing operations: INSERT, REMOVE and REPLACE. • An useful measure to quantify how similar (or dissimilar) two trees are.
  • 16. ● Edit operations ● Normalized distance * Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262. 16 MINIMUM TREE EDIT DISTANCE* Information Retrieval and Data Science 1 2 3 4
  • 17. METHOD : STEP #2 Information Retrieval and Data Science 17 WEB PAGES FROM CRAWLER LIKE APACHE NUTCH STYLE SIMILARITY STYLE SIMILARITY
  • 18. • Similar web pages have similar css styles • XPath : ”//*[@class]/@class” • Simple measure - • Jaccard Similarity on CSS class names 18 Information Retrieval and Data Science STYLE SIMILARITY
  • 19. METHOD : STEP #3 Information Retrieval and Data Science 19 AGGREGATED = k.STRUCTURAL+ (1-k).STYLE STRUCTURAL STYLE
  • 20. METHOD : STEP #4 Information Retrieval and Data Science 20 SIMILARITY MATRIX CLUSTERS CLUSTERING ( SHARED NEAR NEIGHBOR)
  • 21. “If two data points share a threshold number of neighbors, then they must belong to the same cluster” * 21 Information Retrieval and Data Science SHARED NEAR NEIGHBOR (SNN) ALGORITHM * Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034. Web Pages
  • 22. • Guessing k in k-means is hard Meaningful question - “Make clusters of 90% similarity” instead of “Make 10 clusters” • Mean / Average of documents in a cluster? • Average of DOM Trees? • Average of CSS styles? • Circular / Spherical / Globular shapes? 22 Information Retrieval and Data Science WHAT’S GOOD ABOUT SNN ALGORITHM
  • 23. METHOD : LAST STEP* Information Retrieval and Data Science 23 LABELING CLUSTERS CATEGORIES /USABLE CLUSTERS
  • 24. METHOD : LAST STEP* Information Retrieval and Data Science 24 LABELING CLUSTERS CATEGORIES /USABLE CLUSTERS * HUMAN INTERVENTION - THIS STEP REQUIRES DOMAIN EXPERTISE
  • 25. SOME APPLICATIONS? Information Retrieval and Data Science 25 • Separate the interesting web pages? • Drop uninteresting/noisy web pages • Categorical treatment of clusters • Extract Structured data using XPath • Automated extraction using alignment
  • 26. 26 Information Retrieval and Data Science WORKFLOW: PART #1
  • 27. 27Information Retrieval and Data Science WORKFLOW: PART #2
  • 28. DATASET : 1310 Web Pages from http://armslist.com • 987 Ad detail pages • 311 Ad listing pages • 12 others – index, contact, FAQs etc PARAMETERS: • 50% weightage for CSS style 50% weight for HTML structure • Series of experiments on various thresholds : 85%, 90%, 95% Information Retrieval and Data Science EVALUATION 28
  • 29. Information Retrieval and Data Science EVALUATION 29 PARAMETERS: SIMILARITY = 90% SHARED NEIGHBORS = 90%
  • 30. Information Retrieval and Data Science EVALUATION 30 PARAMETERS: SIMILARITY = 95% SHARED NEIGHBORS = 95%
  • 31. Information Retrieval and Data Science EVALUATION 31 PARAMETERS: SIMILARITY = 85% SHARED NEIGHBORS = 85%
  • 32. • TED very expensive • Zhang-Shasha’s TED • O(|T1| x |T2| x Min{depth(T1), leaves(T1)} x Min{depth(T2), leaves(T2)}) • That’s O(n4) • Approx. 1000 HTML Tags • That’s O(1012) Information Retrieval and Data Science CHALLENGES 32 Number of HTML Tags TimeComplexity
  • 33. Information Retrieval and Data Science ACKNOWLEDGMENTS DARPA MEMEX 33 * Photo Credits : http://memex.jpl.nasa.gov/
  • 34. • Source Code https://github.com/USCDataScience/autoextractor • Tutorial https://git.io/vwS69 • Follow up • Thamme Gowda - @thammegowda • Chris Mattmann - @chrismattmann 34 Information Retrieval and Data Science THANK YOU

Editor's Notes

  1. Base version Three variants to illustrate three diffrent operations Count all these operations -