SlideShare a Scribd company logo
July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA
Thamme Gowda
@thammegowda
Dr. Chris Mattmann
@chrismattmann
1
CLUSTERING WEB PAGES BASED ON
STRUCTURE AND STYLE SIMILARITY
Information Retrieval
and Data Science
OUTLINE
• Problem Statement
• Method Overview
• Steps
• Tree Edit Distance
• Style Similarity
• Shared Near Neighbor Clustering
• Evaluation
• Challenges
Information Retrieval
and Data Science
2
PROBLEM STATEMENT
Information Retrieval
and Data Science
3
• Scraping data from online marketplaces
• Start with homepage
→ categories →listing → Actual stuff (Detail page)
SAMPLE WEB PAGES
Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
4
1 2 3 4
8765
USELESS
USELESS
5SAMPLE WEB PAGES
Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
1 2 3 4
8765
USELESS
USELESS
6SAMPLE WEB PAGES
Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
1 2 3 4
8765
USELESS
USELESS
7SAMPLE WEB PAGES
Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
CRAWLER: YES
ANALYSIS: NO
USEFUL USEFUL USEFUL
1 2 3 4
8765
METHOD OVERVIEW
Information Retrieval
and Data Science
8
CLUSTERING
• “task of grouping a set of objects in such a way that objects
in the same group are more similar (in some sense or the
other) to each other than to those in the other groups”
– Wikipedia
• There are many ways to achieve this.
9
Information Retrieval
and Data Science
CLUSTERING
HOW DO WE CLUSTER
Information Retrieval
and Data Science
10
• Based on similarity between pages
• Semantic similarity
• meaning of the web pages (keywords, topics,…)
• Syntactic similarity
• Web page structure, CSS styles
• This presentation has focus on syntactic aspect
• HTML ✓
• CSS ✓
• JavaScript ×
11
Information Retrieval
and Data Science
SIMILARITY CHECK
METHOD : INPUT
Information Retrieval
and Data Science
12
WEB PAGES FROM CRAWLER
LIKE APACHE NUTCH
METHOD : STEP #1
Information Retrieval
and Data Science
13
WEB PAGES FROM CRAWLER
LIKE APACHE NUTCH
STRUCTURAL SIMILARITY
STRUCTURAL SIMILARITY
STRUCTURAL SIMILARITY
Information Retrieval
and Data Science
14
• Web pages are built with
HTML
• HTML Doc → DOM tree
• a labeled ordered tree
• Structural similarity using
tree edit distance(TED)
HTML
HEAD BODY
TITLE DIV P
MINIMUM TREE EDIT DISTANCE
Information Retrieval
and Data Science
15
• Edit distance measure similar to strings, but on
hierarchical data instead of sequences
• Number of editing operations required to transform
one tree into another.
• Three basic editing operations: INSERT, REMOVE and
REPLACE.
• An useful measure to quantify how similar (or
dissimilar) two trees are.
● Edit operations
● Normalized
distance
* Zhang, K., & Shasha, D. (1989).
Simple fast algorithms for the
editing distance between trees
and related problems. SIAM
journal on computing,18(6),
1245-1262.
16
MINIMUM TREE EDIT DISTANCE*
Information Retrieval
and Data Science
1 2
3 4
METHOD : STEP #2
Information Retrieval
and Data Science
17
WEB PAGES FROM CRAWLER
LIKE APACHE NUTCH
STYLE SIMILARITY
STYLE SIMILARITY
• Similar web pages have similar css styles
• XPath : ”//*[@class]/@class”
• Simple measure -
• Jaccard Similarity on CSS class names
18
Information Retrieval
and Data Science
STYLE SIMILARITY
METHOD : STEP #3
Information Retrieval
and Data Science
19
AGGREGATED = k.STRUCTURAL+ (1-k).STYLE
STRUCTURAL
STYLE
METHOD : STEP #4
Information Retrieval
and Data Science
20
SIMILARITY MATRIX CLUSTERS
CLUSTERING
( SHARED NEAR NEIGHBOR)
“If two data points share a threshold number of
neighbors, then they must belong to the same
cluster” *
21
Information Retrieval
and Data Science
SHARED NEAR NEIGHBOR (SNN) ALGORITHM
* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors.
Computers, IEEE Transactions on, 100(11), 1025-1034.
Web Pages
• Guessing k in k-means is hard
Meaningful question - “Make clusters of 90% similarity”
instead of “Make 10 clusters”
• Mean / Average of documents in a cluster?
• Average of DOM Trees?
• Average of CSS styles?
• Circular / Spherical / Globular shapes?
22
Information Retrieval
and Data Science
WHAT’S GOOD ABOUT SNN ALGORITHM
METHOD : LAST STEP*
Information Retrieval
and Data Science
23
LABELING
CLUSTERS CATEGORIES /USABLE CLUSTERS
METHOD : LAST STEP*
Information Retrieval
and Data Science
24
LABELING
CLUSTERS CATEGORIES /USABLE CLUSTERS
* HUMAN INTERVENTION - THIS STEP REQUIRES DOMAIN EXPERTISE
SOME APPLICATIONS?
Information Retrieval
and Data Science
25
• Separate the interesting web pages?
• Drop uninteresting/noisy web pages
• Categorical treatment of clusters
• Extract Structured data using XPath
• Automated extraction using alignment
26
Information Retrieval
and Data Science
WORKFLOW: PART #1
27Information Retrieval
and Data Science
WORKFLOW: PART #2
DATASET :
1310 Web Pages from http://armslist.com
• 987 Ad detail pages
• 311 Ad listing pages
• 12 others – index, contact, FAQs etc
PARAMETERS:
• 50% weightage for CSS style 50% weight for HTML structure
• Series of experiments on various thresholds : 85%, 90%, 95%
Information Retrieval
and Data Science
EVALUATION
28
Information Retrieval
and Data Science
EVALUATION
29
PARAMETERS:
SIMILARITY = 90%
SHARED NEIGHBORS = 90%
Information Retrieval
and Data Science
EVALUATION
30
PARAMETERS:
SIMILARITY = 95%
SHARED NEIGHBORS = 95%
Information Retrieval
and Data Science
EVALUATION
31
PARAMETERS:
SIMILARITY = 85%
SHARED NEIGHBORS = 85%
• TED very expensive
• Zhang-Shasha’s TED
• O(|T1| x |T2|
x Min{depth(T1), leaves(T1)}
x Min{depth(T2), leaves(T2)})
• That’s O(n4)
• Approx. 1000 HTML Tags
• That’s O(1012)
Information Retrieval
and Data Science
CHALLENGES
32
Number of HTML Tags
TimeComplexity
Information Retrieval
and Data Science
ACKNOWLEDGMENTS
DARPA MEMEX
33
* Photo Credits : http://memex.jpl.nasa.gov/
• Source Code
https://github.com/USCDataScience/autoextractor
• Tutorial
https://git.io/vwS69
• Follow up
• Thamme Gowda - @thammegowda
• Chris Mattmann - @chrismattmann
34
Information Retrieval
and Data Science
THANK YOU

More Related Content

What's hot

Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
Marina Santini
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
EDB
 
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
NTT DATA Technology & Innovation
 
スキーマレスカラムナフォーマット「Yosegi」で実現する スキーマの柔軟性と処理性能を両立したログ収集システム / Hadoop / Spark Con...
スキーマレスカラムナフォーマット「Yosegi」で実現する スキーマの柔軟性と処理性能を両立したログ収集システム / Hadoop / Spark Con...スキーマレスカラムナフォーマット「Yosegi」で実現する スキーマの柔軟性と処理性能を両立したログ収集システム / Hadoop / Spark Con...
スキーマレスカラムナフォーマット「Yosegi」で実現する スキーマの柔軟性と処理性能を両立したログ収集システム / Hadoop / Spark Con...
Yahoo!デベロッパーネットワーク
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
MongoDB World 2019: The Journey of Migration from Oracle to MongoDB at Rakuten
MongoDB World 2019: The Journey of Migration from Oracle to MongoDB at RakutenMongoDB World 2019: The Journey of Migration from Oracle to MongoDB at Rakuten
MongoDB World 2019: The Journey of Migration from Oracle to MongoDB at Rakuten
MongoDB
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
Spark Summit
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
Changjin Lee
 
PGroonga – Make PostgreSQL fast full text search platform for all languages!
PGroonga – Make PostgreSQL fast full text search platform for all languages!PGroonga – Make PostgreSQL fast full text search platform for all languages!
PGroonga – Make PostgreSQL fast full text search platform for all languages!
Kouhei Sutou
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
Amazon Web Services
 
Clickstream analytics with Markov Chains
Clickstream analytics with Markov ChainsClickstream analytics with Markov Chains
Clickstream analytics with Markov Chains
Alex Papageorgiou
 
Google Dataflow Intro
Google Dataflow IntroGoogle Dataflow Intro
Google Dataflow Intro
Ivan Glushkov
 
PostgreSQLの範囲型と排他制約
PostgreSQLの範囲型と排他制約PostgreSQLの範囲型と排他制約
PostgreSQLの範囲型と排他制約Akio Ishida
 
What is data engineering?
What is data engineering?What is data engineering?
What is data engineering?
yongdam kim
 
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
DataWorks Summit/Hadoop Summit
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
Daniel Hen
 
Data reduction
Data reductionData reduction
Data reduction
kalavathisugan
 
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseGraph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
Connected Data World
 
Logをs3とredshiftに格納する仕組み
Logをs3とredshiftに格納する仕組みLogをs3とredshiftに格納する仕組み
Logをs3とredshiftに格納する仕組みKen Morishita
 

What's hot (20)

Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
 
スキーマレスカラムナフォーマット「Yosegi」で実現する スキーマの柔軟性と処理性能を両立したログ収集システム / Hadoop / Spark Con...
スキーマレスカラムナフォーマット「Yosegi」で実現する スキーマの柔軟性と処理性能を両立したログ収集システム / Hadoop / Spark Con...スキーマレスカラムナフォーマット「Yosegi」で実現する スキーマの柔軟性と処理性能を両立したログ収集システム / Hadoop / Spark Con...
スキーマレスカラムナフォーマット「Yosegi」で実現する スキーマの柔軟性と処理性能を両立したログ収集システム / Hadoop / Spark Con...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
MongoDB World 2019: The Journey of Migration from Oracle to MongoDB at Rakuten
MongoDB World 2019: The Journey of Migration from Oracle to MongoDB at RakutenMongoDB World 2019: The Journey of Migration from Oracle to MongoDB at Rakuten
MongoDB World 2019: The Journey of Migration from Oracle to MongoDB at Rakuten
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
 
PGroonga – Make PostgreSQL fast full text search platform for all languages!
PGroonga – Make PostgreSQL fast full text search platform for all languages!PGroonga – Make PostgreSQL fast full text search platform for all languages!
PGroonga – Make PostgreSQL fast full text search platform for all languages!
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
 
Clickstream analytics with Markov Chains
Clickstream analytics with Markov ChainsClickstream analytics with Markov Chains
Clickstream analytics with Markov Chains
 
Google Dataflow Intro
Google Dataflow IntroGoogle Dataflow Intro
Google Dataflow Intro
 
PostgreSQLの範囲型と排他制約
PostgreSQLの範囲型と排他制約PostgreSQLの範囲型と排他制約
PostgreSQLの範囲型と排他制約
 
What is data engineering?
What is data engineering?What is data engineering?
What is data engineering?
 
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
 
Firebirdの障害対策
Firebirdの障害対策Firebirdの障害対策
Firebirdの障害対策
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Data reduction
Data reductionData reduction
Data reduction
 
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseGraph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
 
Logをs3とredshiftに格納する仕組み
Logをs3とredshiftに格納する仕組みLogをs3とredshiftに格納する仕組み
Logをs3とredshiftに格納する仕組み
 

Viewers also liked

College Profile
College ProfileCollege Profile
College Profile
IPGenius Inc.
 
318157119 the-village
318157119 the-village318157119 the-village
318157119 the-village
hayat alishah
 
2008 utpp po2
2008 utpp po22008 utpp po2
2008 utpp po2
hayat alishah
 
Bl fam 239011
Bl fam 239011Bl fam 239011
Bl fam 239011
hayat alishah
 
Лид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Лид скоринг или Цикл принятия решения о покупке в b2b-сегментеЛид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Лид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Vladyslava Rykova
 
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
ZSL Biodiversity & Palm Oil Platform
 
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
ZSL Biodiversity & Palm Oil Platform
 
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
ZSL Biodiversity & Palm Oil Platform
 
FDHS Class of 1992 Over the Years
FDHS Class of 1992 Over the YearsFDHS Class of 1992 Over the Years
FDHS Class of 1992 Over the Yearsimbarefootin
 
Detector movemento cdm 180
Detector movemento cdm 180Detector movemento cdm 180
Detector movemento cdm 180
Xosé Manoel Álvarez López
 
The quest for the Entrepreneurial North Star
The quest for the Entrepreneurial North StarThe quest for the Entrepreneurial North Star
The quest for the Entrepreneurial North Star
Ruta Aidis
 
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
Vladyslava Rykova
 
Global Social Media Statistics 2012
Global Social Media Statistics 2012Global Social Media Statistics 2012
Global Social Media Statistics 2012
Harsh Wardhan Dave
 
Клонирование интернет-магазинов. Сайты-аффилиаты
Клонирование интернет-магазинов. Сайты-аффилиатыКлонирование интернет-магазинов. Сайты-аффилиаты
Клонирование интернет-магазинов. Сайты-аффилиаты
Vladyslava Rykova
 
33 Ways to Save Money
33 Ways to Save Money33 Ways to Save Money
The Maine Trial Lawyers Association
The Maine Trial Lawyers AssociationThe Maine Trial Lawyers Association
The Maine Trial Lawyers AssociationHolmes Legal Group
 
Summerifeld84 lt
Summerifeld84 ltSummerifeld84 lt
Summerifeld84 lt
summerfield84
 
Q2 adp 2015-16 sectoral format for sports
Q2 adp 2015-16 sectoral format for sportsQ2 adp 2015-16 sectoral format for sports
Q2 adp 2015-16 sectoral format for sports
hayat alishah
 
What Mobile Users Want
What Mobile Users WantWhat Mobile Users Want
What Mobile Users Want
Harsh Wardhan Dave
 

Viewers also liked (20)

College Profile
College ProfileCollege Profile
College Profile
 
318157119 the-village
318157119 the-village318157119 the-village
318157119 the-village
 
2008 utpp po2
2008 utpp po22008 utpp po2
2008 utpp po2
 
Bl fam 239011
Bl fam 239011Bl fam 239011
Bl fam 239011
 
Лид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Лид скоринг или Цикл принятия решения о покупке в b2b-сегментеЛид скоринг или Цикл принятия решения о покупке в b2b-сегменте
Лид скоринг или Цикл принятия решения о покупке в b2b-сегменте
 
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
Session 6-1-john-laidlow-responsible-investment-as-a-tool-to-guide-sustainabl...
 
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
Session 3-5-moray-mcleish-directing-oil-palm-expansion-onto-degraded-land-1471
 
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
Session 6-5-chen-ying-prospects-and-challenges-for-sustainable-palm-oil-in-ch...
 
FDHS Class of 1992 Over the Years
FDHS Class of 1992 Over the YearsFDHS Class of 1992 Over the Years
FDHS Class of 1992 Over the Years
 
Истоки (2008 год)
Истоки (2008 год)Истоки (2008 год)
Истоки (2008 год)
 
Detector movemento cdm 180
Detector movemento cdm 180Detector movemento cdm 180
Detector movemento cdm 180
 
The quest for the Entrepreneurial North Star
The quest for the Entrepreneurial North StarThe quest for the Entrepreneurial North Star
The quest for the Entrepreneurial North Star
 
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
PR в Интернете. С чем будет иметь дело PR-менеджер в вебе?
 
Global Social Media Statistics 2012
Global Social Media Statistics 2012Global Social Media Statistics 2012
Global Social Media Statistics 2012
 
Клонирование интернет-магазинов. Сайты-аффилиаты
Клонирование интернет-магазинов. Сайты-аффилиатыКлонирование интернет-магазинов. Сайты-аффилиаты
Клонирование интернет-магазинов. Сайты-аффилиаты
 
33 Ways to Save Money
33 Ways to Save Money33 Ways to Save Money
33 Ways to Save Money
 
The Maine Trial Lawyers Association
The Maine Trial Lawyers AssociationThe Maine Trial Lawyers Association
The Maine Trial Lawyers Association
 
Summerifeld84 lt
Summerifeld84 ltSummerifeld84 lt
Summerifeld84 lt
 
Q2 adp 2015-16 sectoral format for sports
Q2 adp 2015-16 sectoral format for sportsQ2 adp 2015-16 sectoral format for sports
Q2 adp 2015-16 sectoral format for sports
 
What Mobile Users Want
What Mobile Users WantWhat Mobile Users Want
What Mobile Users Want
 

Similar to IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
Thamme Gowda
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
tafosepsdfasg
 
Project 0th Review
Project 0th ReviewProject 0th Review
Project 0th Review
Divakar Raj M
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
malathieswaran29
 
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Artem Chebotko
 
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
adeel8937
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
eXascale Infolab
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don Day
Information Development World
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
hktripathy
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
Richard Garris
 
Unit iii
Unit iiiUnit iii
Unit iii
Kgr Sushmitha
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
The Statistical and Applied Mathematical Sciences Institute
 
Lecture1
Lecture1Lecture1
Lecture1
Manish Singh
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
IdontKnow66967
 
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
ScyllaDB
 
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Anastasija Nikiforova
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
Robert Dempsey
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
nayanakarsh469
 
UNIT01-DBMS.ppt
UNIT01-DBMS.pptUNIT01-DBMS.ppt
UNIT01-DBMS.ppt
JacobDragonette
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
Dhilsath Fathima
 

Similar to IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity (20)

Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Project 0th Review
Project 0th ReviewProject 0th Review
Project 0th Review
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
 
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
vdocuments.mx_chapter-2-database-environment-thomas-connolly-carolyn-begg-dat...
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don Day
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Unit iii
Unit iiiUnit iii
Unit iii
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Lecture1
Lecture1Lecture1
Lecture1
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
 
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
UNIT01-DBMS.ppt
UNIT01-DBMS.pptUNIT01-DBMS.ppt
UNIT01-DBMS.ppt
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 

More from Thamme Gowda

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
Thamme Gowda
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important too
Thamme Gowda
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
Thamme Gowda
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Thamme Gowda
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Thamme Gowda
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
Thamme Gowda
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
Thamme Gowda
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda
 

More from Thamme Gowda (8)

Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
 
Macro average: rare types are important too
Macro average: rare types are important tooMacro average: rare types are important too
Macro average: rare types are important too
 
500 languages to English Machine Translation Model
500 languages to English Machine Translation Model500 languages to English Machine Translation Model
500 languages to English Machine Translation Model
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
 
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
 
Thamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL InternshipThamme Gowda's Summer2016- NASA JPL Internship
Thamme Gowda's Summer2016- NASA JPL Internship
 

Recently uploaded

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 

Recently uploaded (20)

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 

IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

  • 1. July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA Thamme Gowda @thammegowda Dr. Chris Mattmann @chrismattmann 1 CLUSTERING WEB PAGES BASED ON STRUCTURE AND STYLE SIMILARITY Information Retrieval and Data Science
  • 2. OUTLINE • Problem Statement • Method Overview • Steps • Tree Edit Distance • Style Similarity • Shared Near Neighbor Clustering • Evaluation • Challenges Information Retrieval and Data Science 2
  • 3. PROBLEM STATEMENT Information Retrieval and Data Science 3 • Scraping data from online marketplaces • Start with homepage → categories →listing → Actual stuff (Detail page)
  • 4. SAMPLE WEB PAGES Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov 4 1 2 3 4 8765
  • 5. USELESS USELESS 5SAMPLE WEB PAGES Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov 1 2 3 4 8765
  • 6. USELESS USELESS 6SAMPLE WEB PAGES Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov CRAWLER: YES ANALYSIS: NO CRAWLER: YES ANALYSIS: NO CRAWLER: YES ANALYSIS: NO 1 2 3 4 8765
  • 7. USELESS USELESS 7SAMPLE WEB PAGES Credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov CRAWLER: YES ANALYSIS: NO CRAWLER: YES ANALYSIS: NO CRAWLER: YES ANALYSIS: NO USEFUL USEFUL USEFUL 1 2 3 4 8765
  • 8. METHOD OVERVIEW Information Retrieval and Data Science 8 CLUSTERING
  • 9. • “task of grouping a set of objects in such a way that objects in the same group are more similar (in some sense or the other) to each other than to those in the other groups” – Wikipedia • There are many ways to achieve this. 9 Information Retrieval and Data Science CLUSTERING
  • 10. HOW DO WE CLUSTER Information Retrieval and Data Science 10 • Based on similarity between pages • Semantic similarity • meaning of the web pages (keywords, topics,…) • Syntactic similarity • Web page structure, CSS styles • This presentation has focus on syntactic aspect
  • 11. • HTML ✓ • CSS ✓ • JavaScript × 11 Information Retrieval and Data Science SIMILARITY CHECK
  • 12. METHOD : INPUT Information Retrieval and Data Science 12 WEB PAGES FROM CRAWLER LIKE APACHE NUTCH
  • 13. METHOD : STEP #1 Information Retrieval and Data Science 13 WEB PAGES FROM CRAWLER LIKE APACHE NUTCH STRUCTURAL SIMILARITY STRUCTURAL SIMILARITY
  • 14. STRUCTURAL SIMILARITY Information Retrieval and Data Science 14 • Web pages are built with HTML • HTML Doc → DOM tree • a labeled ordered tree • Structural similarity using tree edit distance(TED) HTML HEAD BODY TITLE DIV P
  • 15. MINIMUM TREE EDIT DISTANCE Information Retrieval and Data Science 15 • Edit distance measure similar to strings, but on hierarchical data instead of sequences • Number of editing operations required to transform one tree into another. • Three basic editing operations: INSERT, REMOVE and REPLACE. • An useful measure to quantify how similar (or dissimilar) two trees are.
  • 16. ● Edit operations ● Normalized distance * Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262. 16 MINIMUM TREE EDIT DISTANCE* Information Retrieval and Data Science 1 2 3 4
  • 17. METHOD : STEP #2 Information Retrieval and Data Science 17 WEB PAGES FROM CRAWLER LIKE APACHE NUTCH STYLE SIMILARITY STYLE SIMILARITY
  • 18. • Similar web pages have similar css styles • XPath : ”//*[@class]/@class” • Simple measure - • Jaccard Similarity on CSS class names 18 Information Retrieval and Data Science STYLE SIMILARITY
  • 19. METHOD : STEP #3 Information Retrieval and Data Science 19 AGGREGATED = k.STRUCTURAL+ (1-k).STYLE STRUCTURAL STYLE
  • 20. METHOD : STEP #4 Information Retrieval and Data Science 20 SIMILARITY MATRIX CLUSTERS CLUSTERING ( SHARED NEAR NEIGHBOR)
  • 21. “If two data points share a threshold number of neighbors, then they must belong to the same cluster” * 21 Information Retrieval and Data Science SHARED NEAR NEIGHBOR (SNN) ALGORITHM * Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034. Web Pages
  • 22. • Guessing k in k-means is hard Meaningful question - “Make clusters of 90% similarity” instead of “Make 10 clusters” • Mean / Average of documents in a cluster? • Average of DOM Trees? • Average of CSS styles? • Circular / Spherical / Globular shapes? 22 Information Retrieval and Data Science WHAT’S GOOD ABOUT SNN ALGORITHM
  • 23. METHOD : LAST STEP* Information Retrieval and Data Science 23 LABELING CLUSTERS CATEGORIES /USABLE CLUSTERS
  • 24. METHOD : LAST STEP* Information Retrieval and Data Science 24 LABELING CLUSTERS CATEGORIES /USABLE CLUSTERS * HUMAN INTERVENTION - THIS STEP REQUIRES DOMAIN EXPERTISE
  • 25. SOME APPLICATIONS? Information Retrieval and Data Science 25 • Separate the interesting web pages? • Drop uninteresting/noisy web pages • Categorical treatment of clusters • Extract Structured data using XPath • Automated extraction using alignment
  • 26. 26 Information Retrieval and Data Science WORKFLOW: PART #1
  • 27. 27Information Retrieval and Data Science WORKFLOW: PART #2
  • 28. DATASET : 1310 Web Pages from http://armslist.com • 987 Ad detail pages • 311 Ad listing pages • 12 others – index, contact, FAQs etc PARAMETERS: • 50% weightage for CSS style 50% weight for HTML structure • Series of experiments on various thresholds : 85%, 90%, 95% Information Retrieval and Data Science EVALUATION 28
  • 29. Information Retrieval and Data Science EVALUATION 29 PARAMETERS: SIMILARITY = 90% SHARED NEIGHBORS = 90%
  • 30. Information Retrieval and Data Science EVALUATION 30 PARAMETERS: SIMILARITY = 95% SHARED NEIGHBORS = 95%
  • 31. Information Retrieval and Data Science EVALUATION 31 PARAMETERS: SIMILARITY = 85% SHARED NEIGHBORS = 85%
  • 32. • TED very expensive • Zhang-Shasha’s TED • O(|T1| x |T2| x Min{depth(T1), leaves(T1)} x Min{depth(T2), leaves(T2)}) • That’s O(n4) • Approx. 1000 HTML Tags • That’s O(1012) Information Retrieval and Data Science CHALLENGES 32 Number of HTML Tags TimeComplexity
  • 33. Information Retrieval and Data Science ACKNOWLEDGMENTS DARPA MEMEX 33 * Photo Credits : http://memex.jpl.nasa.gov/
  • 34. • Source Code https://github.com/USCDataScience/autoextractor • Tutorial https://git.io/vwS69 • Follow up • Thamme Gowda - @thammegowda • Chris Mattmann - @chrismattmann 34 Information Retrieval and Data Science THANK YOU

Editor's Notes

  1. Base version Three variants to illustrate three diffrent operations Count all these operations -