SlideShare a Scribd company logo
1 of 24
Download to read offline
Effective Unsupervised Matching of
Product Titles with k-Combinations
and Permutations
Leonidas Akritidis, Panayiotis Bozanis
Department of Electrical and Computer Engineering
University of Thessaly, Greece
L. Akritidis, P. Bozanis 1IEEE INISTA 2018
The problem (1)
• We are given a set of F={f1,f2,…fN} product feeds
(usually in XML format).
• Each feed fi originates from an electronic store ei
and contains product records.
• Each product record p may contain multiple fields
(title, description, price, brand, category, etc).
• A product cannot appear more than once in the
same feed.
• But it may appear in multiple feeds.
The problem (2)
• A product may be described differently in these
feeds (i.e. it appears under different titles).
• E.g. “Apple iPhone 7” and “iPhone 7” are different
titles which refer to the same product.
• The problem: Match the product titles and
identify if they describe the same product.
• Useful for:
– Price comparison applications & platforms.
– Reviews merging & aggregation.
– Users who desire to compare characteristics & prices.
Similarity/Distance Metrics
• “Apple iPhone 7” and “iPhone 7” are different
titles which refer to the same product.
– Even though a whole word is missing from the second
title (small similarity/distance).
• “Apple iPhone 7” and “Apple iPhone 6” are titles
which DO NOT refer to the same product.
– Even though they only differ by a single character
(higher similarity/distance).
• Similarity/Distance metrics (cosine, Jaccard, edit
distance, etc.) do not work well in this problem.
Supervised Clustering
• For the same reason, the supervised machine
learning clustering approaches (kNN, naïve Bayes,
linear/logistic regression) also do not work well.
– Smaller distances/higher probabilities should not
necessarily be clustered to the same entity.
– Higher distances/smaller probabilities should not
necessarily be clustered in different entities.
State-of-the-art (1)
• V. Gopalakrishnan, SP. Iyengar, A. Madaan, R.
Rastogi, S. Sengamedu. Matching product titles
using web-based enrichment. In Proceedings of
the 21st ACM international conference on
Information and knowledge management, pp.
605-614, 2012.
• N. Londhe, V. Gopalakrishnan, A. Zhang, HQ Ngo,
R. Srihari. Matching titles with cross title web-
search enrichment and community detection. In
Proceedings of the VLDB Endowment, pp. 1167-
1178, 2014.
State-of-the-art (2)
• These approaches are similar:
– They enrich each product title by injecting several
missing words.
– They treat each word in the products’ titles differently,
i.e. each word is assigned an importance score.
– After these two preprocessing phases, they apply the
cosine similarity measure (with an over simplistic
blocking method).
– They create clusters which consist of the same
products.
State-of-the-art - Disadvantages
• One query submitted to a SE per product:
– this approach is infeasible for large-scale datasets.
• In their experiments they use only 2 feeds.
– Most platforms include thousands of electronic stores
(i.e. product feeds).
• They employ the cosine similarity metric.
– which does not perform well in this problem.
Our approach is…
• Standalone: It does not rely on external data
sources (i.e. Web search engines, Web sites, ect).
• Unsupervised: No requirement to manually train
a classifier, or split the dataset in training and
testing data subsets.
• Efficient: Faster than the adversary approach; it
makes use of in-memory data structures.
• Flexible: It facilitates product classification into
multiple clusters.
Overview (1)
• Our proposed method operates in 2 phases:
• Phase 1: construction of two primary data
structures:
– A lexicon which consists of all the k-
combinations of the titles’ words, along with a
frequency value and some statistics.
• Each k-combination is a candidate product cluster.
– A forward index: An array which stores for each
product, a list of pointers to the respective title
k-combinations (we use pointers to avoid saving
the same data twice).
Overview (2)
• Phase 2: We employ these
two data structures to
assign scores to each k-
combination of each
product.
• The k-combinations are
then sorted by decreasing
score value and the highest
scoring combination
represents the cluster.
k-combinations
• k-combinations are combinations of the
words of the product title.
• Length (number of words) = k.
• Without repetition.
• Without care for word ordering.
• We compute the K-combinations of each
product title,
• Number of combinations for a
title which consists of n words:  

 
nkK
k knk
n,
2 !!
!
 6,2K
Phase 1 (1)
Data Structures - Lexicon
• We employ a lexicon structure L to store the
combinations. We also store two statistics:
• A frequency value which represents the number
of documents which contain this combination.
– Frequent combinations are more likely to be declared
cluster labels.
• A distance value which stores the average
distance of the combination from the beginning
of the titles.
– The most important terms in a product description
appear early in the titles.
Data Structures – Forward Index
• We also employ a forward index I which for each
product p, stores a pointer to each combination.
• We assign a score value to each combination in I.
Distance
• Some frequent terms in the titles have no
informational value (i.e. they do not describe the
product, but they contain offers, specs, etc).
– E.g. many products have in their titles the terms “EU”, “OEM”,
“Retail”, etc.
– Therefore, in some cases we get wrong cluster labels, e.g.
“Apple iPhone EU”.
– Similar problems can also be caused by other words: colors
(black, white, red, etc), sizes (large, small, etc) and others.
• Key observation: These terms usually appear
late in the title (i.e. in high position).
Phase 1 (2)
Permutations (3)
• In case a combination is not found in the lexicon,
we compute all its permutations.
• We search for each permutation in the lexicon.
• In case it is found, we increase the frequency of
the corresponding combination and we stop
searching.
• In case it is not found, we do not insert it
• We shall insert the corresponding combination
instead, after all the permutations have been
examined.
Phase 1 (3)
Phase 2
• In phase 2 we compute the scores of each k-
combination of each product.
• To achieve this goal we use the forward index.
• We sort the forward list in decreasing score order.
• The first element of the sorted list is the cluster.
An indicative score function
• Score function
where l(c) is the length of the combination/label, N(c)
is the frequency, and d(c,t) is the average distance of
the combination from the beginning of the string.
   
 
 cN
tcda
cl
cS log
,

Results
• We deployed a focused crawler on skroutz.gr and we
collected 16208 products (mobile phones) classified
in 922 clusters.
• Vendors: 320
• Average number of words in a title: 9
• We consider the classification of skroutz.gr as the
ground truth and we compare the effectiveness of
our algorithm (UMaP) against this.
Effectiveness – F1 measure
Efficiency

More Related Content

What's hot

Lecture 03 data abstraction and er model
Lecture 03 data abstraction and er modelLecture 03 data abstraction and er model
Lecture 03 data abstraction and er model
emailharmeet
 
Logical database design and the relational model(database)
Logical database design and the relational model(database)Logical database design and the relational model(database)
Logical database design and the relational model(database)
welcometofacebook
 
Database management systems 3 - Data Modelling
Database management systems 3 - Data ModellingDatabase management systems 3 - Data Modelling
Database management systems 3 - Data Modelling
Nickkisha Farrell
 
Ch 5 O O Data Modeling
Ch 5  O O  Data ModelingCh 5  O O  Data Modeling
Ch 5 O O Data Modeling
guest8fdbdd
 
Entity relationship diagram (erd)
Entity relationship diagram (erd)Entity relationship diagram (erd)
Entity relationship diagram (erd)
tameemyousaf
 
E-R diagram in Database
E-R diagram in DatabaseE-R diagram in Database
E-R diagram in Database
Fatiha Qureshi
 
Ch 5 O O Data Modeling Class
Ch 5  O O  Data Modeling ClassCh 5  O O  Data Modeling Class
Ch 5 O O Data Modeling Class
guest8fdbdd
 

What's hot (19)

Erd1
Erd1Erd1
Erd1
 
Lecture 03 data abstraction and er model
Lecture 03 data abstraction and er modelLecture 03 data abstraction and er model
Lecture 03 data abstraction and er model
 
Logical database design and the relational model(database)
Logical database design and the relational model(database)Logical database design and the relational model(database)
Logical database design and the relational model(database)
 
Chapter 8
Chapter 8Chapter 8
Chapter 8
 
Relational Database & Database Management System
Relational Database & Database Management SystemRelational Database & Database Management System
Relational Database & Database Management System
 
Datacleaning.ppt
Datacleaning.pptDatacleaning.ppt
Datacleaning.ppt
 
Data modeling
Data modelingData modeling
Data modeling
 
Database management systems 3 - Data Modelling
Database management systems 3 - Data ModellingDatabase management systems 3 - Data Modelling
Database management systems 3 - Data Modelling
 
Ch 5 O O Data Modeling
Ch 5  O O  Data ModelingCh 5  O O  Data Modeling
Ch 5 O O Data Modeling
 
Entity relationship diagram (erd)
Entity relationship diagram (erd)Entity relationship diagram (erd)
Entity relationship diagram (erd)
 
Advantages and disadvantages of er model in DBMS. Types of database models ..
Advantages and disadvantages of er model in DBMS. Types of database models ..Advantages and disadvantages of er model in DBMS. Types of database models ..
Advantages and disadvantages of er model in DBMS. Types of database models ..
 
E-R diagram in Database
E-R diagram in DatabaseE-R diagram in Database
E-R diagram in Database
 
Relational database
Relational databaseRelational database
Relational database
 
ER DIAGRAM & ER MODELING IN DBMS
ER DIAGRAM & ER MODELING IN DBMSER DIAGRAM & ER MODELING IN DBMS
ER DIAGRAM & ER MODELING IN DBMS
 
ER Modeling and Introduction to RDBMS
ER Modeling and Introduction to RDBMSER Modeling and Introduction to RDBMS
ER Modeling and Introduction to RDBMS
 
Ch 5 O O Data Modeling Class
Ch 5  O O  Data Modeling ClassCh 5  O O  Data Modeling Class
Ch 5 O O Data Modeling Class
 
Er Modeling
Er ModelingEr Modeling
Er Modeling
 
Pg. 03 question two 1introduction to databasesi
Pg. 03 question two 1introduction to databasesiPg. 03 question two 1introduction to databasesi
Pg. 03 question two 1introduction to databasesi
 
Erd chapter 3
Erd chapter 3Erd chapter 3
Erd chapter 3
 

Similar to Effective Unsupervised Matching of Product Titles

Compare and Contrast Essay AssignmentLength At least three fu.docx
Compare and Contrast Essay AssignmentLength At least three fu.docxCompare and Contrast Essay AssignmentLength At least three fu.docx
Compare and Contrast Essay AssignmentLength At least three fu.docx
monicafrancis71118
 
Tag And Tag Based Recommender
Tag And Tag Based RecommenderTag And Tag Based Recommender
Tag And Tag Based Recommender
gu wendong
 

Similar to Effective Unsupervised Matching of Product Titles (20)

Table Retrieval and Generation
Table Retrieval and GenerationTable Retrieval and Generation
Table Retrieval and Generation
 
Apex code (Salesforce)
Apex code (Salesforce)Apex code (Salesforce)
Apex code (Salesforce)
 
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...
 
Explain Yourself: Why You Get the Recommendations You Do
Explain Yourself: Why You Get the Recommendations You DoExplain Yourself: Why You Get the Recommendations You Do
Explain Yourself: Why You Get the Recommendations You Do
 
Weka
WekaWeka
Weka
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural Network
 
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
 
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...
 
10 Years of Multi-Label Learning
10 Years of Multi-Label Learning10 Years of Multi-Label Learning
10 Years of Multi-Label Learning
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
 
New c sharp3_features_(linq)_part_v
New c sharp3_features_(linq)_part_vNew c sharp3_features_(linq)_part_v
New c sharp3_features_(linq)_part_v
 
Compare and Contrast Essay AssignmentLength At least three fu.docx
Compare and Contrast Essay AssignmentLength At least three fu.docxCompare and Contrast Essay AssignmentLength At least three fu.docx
Compare and Contrast Essay AssignmentLength At least three fu.docx
 
APPLIED PRODUCTIVITY TOOLS WITH ADVANCED APPLICATION TECHNIQUES CUSTOM ANIMAT...
APPLIED PRODUCTIVITY TOOLS WITH ADVANCED APPLICATION TECHNIQUES CUSTOM ANIMAT...APPLIED PRODUCTIVITY TOOLS WITH ADVANCED APPLICATION TECHNIQUES CUSTOM ANIMAT...
APPLIED PRODUCTIVITY TOOLS WITH ADVANCED APPLICATION TECHNIQUES CUSTOM ANIMAT...
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
Memory models in c#
Memory models in c#Memory models in c#
Memory models in c#
 
Dataanalysis
DataanalysisDataanalysis
Dataanalysis
 
Tag And Tag Based Recommender
Tag And Tag Based RecommenderTag And Tag Based Recommender
Tag And Tag Based Recommender
 

More from Leonidas Akritidis

More from Leonidas Akritidis (8)

An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank AggregationAn Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
 
A Self-Pruning Classification Model for News
A Self-Pruning Classification Model for NewsA Self-Pruning Classification Model for News
A Self-Pruning Classification Model for News
 
Effective Products Categorization with Importance Scores and Morphological An...
Effective Products Categorization with ImportanceScores and Morphological An...Effective Products Categorization with ImportanceScores and Morphological An...
Effective Products Categorization with Importance Scores and Morphological An...
 
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
 
A Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research ArticlesA Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research Articles
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Positional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesPositional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted Indexes
 
Identifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does MatterIdentifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does Matter
 

Recently uploaded

electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.
benjamincojr
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
drjose256
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
rahulmanepalli02
 

Recently uploaded (20)

21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological university21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological university
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, Functions
 
electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
 
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and ToolsMaximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Circuit Breakers for Engineering Students
Circuit Breakers for Engineering StudentsCircuit Breakers for Engineering Students
Circuit Breakers for Engineering Students
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdf
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
History of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & ModernizationHistory of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & Modernization
 
Independent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging StationIndependent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging Station
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
 
The Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptxThe Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptx
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
 
Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...
 
Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 

Effective Unsupervised Matching of Product Titles

  • 1. Effective Unsupervised Matching of Product Titles with k-Combinations and Permutations Leonidas Akritidis, Panayiotis Bozanis Department of Electrical and Computer Engineering University of Thessaly, Greece L. Akritidis, P. Bozanis 1IEEE INISTA 2018
  • 2. The problem (1) • We are given a set of F={f1,f2,…fN} product feeds (usually in XML format). • Each feed fi originates from an electronic store ei and contains product records. • Each product record p may contain multiple fields (title, description, price, brand, category, etc). • A product cannot appear more than once in the same feed. • But it may appear in multiple feeds.
  • 3. The problem (2) • A product may be described differently in these feeds (i.e. it appears under different titles). • E.g. “Apple iPhone 7” and “iPhone 7” are different titles which refer to the same product. • The problem: Match the product titles and identify if they describe the same product. • Useful for: – Price comparison applications & platforms. – Reviews merging & aggregation. – Users who desire to compare characteristics & prices.
  • 4. Similarity/Distance Metrics • “Apple iPhone 7” and “iPhone 7” are different titles which refer to the same product. – Even though a whole word is missing from the second title (small similarity/distance). • “Apple iPhone 7” and “Apple iPhone 6” are titles which DO NOT refer to the same product. – Even though they only differ by a single character (higher similarity/distance). • Similarity/Distance metrics (cosine, Jaccard, edit distance, etc.) do not work well in this problem.
  • 5. Supervised Clustering • For the same reason, the supervised machine learning clustering approaches (kNN, naïve Bayes, linear/logistic regression) also do not work well. – Smaller distances/higher probabilities should not necessarily be clustered to the same entity. – Higher distances/smaller probabilities should not necessarily be clustered in different entities.
  • 6. State-of-the-art (1) • V. Gopalakrishnan, SP. Iyengar, A. Madaan, R. Rastogi, S. Sengamedu. Matching product titles using web-based enrichment. In Proceedings of the 21st ACM international conference on Information and knowledge management, pp. 605-614, 2012. • N. Londhe, V. Gopalakrishnan, A. Zhang, HQ Ngo, R. Srihari. Matching titles with cross title web- search enrichment and community detection. In Proceedings of the VLDB Endowment, pp. 1167- 1178, 2014.
  • 7. State-of-the-art (2) • These approaches are similar: – They enrich each product title by injecting several missing words. – They treat each word in the products’ titles differently, i.e. each word is assigned an importance score. – After these two preprocessing phases, they apply the cosine similarity measure (with an over simplistic blocking method). – They create clusters which consist of the same products.
  • 8. State-of-the-art - Disadvantages • One query submitted to a SE per product: – this approach is infeasible for large-scale datasets. • In their experiments they use only 2 feeds. – Most platforms include thousands of electronic stores (i.e. product feeds). • They employ the cosine similarity metric. – which does not perform well in this problem.
  • 9. Our approach is… • Standalone: It does not rely on external data sources (i.e. Web search engines, Web sites, ect). • Unsupervised: No requirement to manually train a classifier, or split the dataset in training and testing data subsets. • Efficient: Faster than the adversary approach; it makes use of in-memory data structures. • Flexible: It facilitates product classification into multiple clusters.
  • 10. Overview (1) • Our proposed method operates in 2 phases: • Phase 1: construction of two primary data structures: – A lexicon which consists of all the k- combinations of the titles’ words, along with a frequency value and some statistics. • Each k-combination is a candidate product cluster. – A forward index: An array which stores for each product, a list of pointers to the respective title k-combinations (we use pointers to avoid saving the same data twice).
  • 11. Overview (2) • Phase 2: We employ these two data structures to assign scores to each k- combination of each product. • The k-combinations are then sorted by decreasing score value and the highest scoring combination represents the cluster.
  • 12. k-combinations • k-combinations are combinations of the words of the product title. • Length (number of words) = k. • Without repetition. • Without care for word ordering. • We compute the K-combinations of each product title, • Number of combinations for a title which consists of n words:      nkK k knk n, 2 !! !  6,2K
  • 14. Data Structures - Lexicon • We employ a lexicon structure L to store the combinations. We also store two statistics: • A frequency value which represents the number of documents which contain this combination. – Frequent combinations are more likely to be declared cluster labels. • A distance value which stores the average distance of the combination from the beginning of the titles. – The most important terms in a product description appear early in the titles.
  • 15. Data Structures – Forward Index • We also employ a forward index I which for each product p, stores a pointer to each combination. • We assign a score value to each combination in I.
  • 16. Distance • Some frequent terms in the titles have no informational value (i.e. they do not describe the product, but they contain offers, specs, etc). – E.g. many products have in their titles the terms “EU”, “OEM”, “Retail”, etc. – Therefore, in some cases we get wrong cluster labels, e.g. “Apple iPhone EU”. – Similar problems can also be caused by other words: colors (black, white, red, etc), sizes (large, small, etc) and others. • Key observation: These terms usually appear late in the title (i.e. in high position).
  • 18. Permutations (3) • In case a combination is not found in the lexicon, we compute all its permutations. • We search for each permutation in the lexicon. • In case it is found, we increase the frequency of the corresponding combination and we stop searching. • In case it is not found, we do not insert it • We shall insert the corresponding combination instead, after all the permutations have been examined.
  • 20. Phase 2 • In phase 2 we compute the scores of each k- combination of each product. • To achieve this goal we use the forward index. • We sort the forward list in decreasing score order. • The first element of the sorted list is the cluster.
  • 21. An indicative score function • Score function where l(c) is the length of the combination/label, N(c) is the frequency, and d(c,t) is the average distance of the combination from the beginning of the string.        cN tcda cl cS log , 
  • 22. Results • We deployed a focused crawler on skroutz.gr and we collected 16208 products (mobile phones) classified in 922 clusters. • Vendors: 320 • Average number of words in a title: 9 • We consider the classification of skroutz.gr as the ground truth and we compare the effectiveness of our algorithm (UMaP) against this.