SlideShare a Scribd company logo
Effective Products Categorization with Importance
Scores and Morphological Analysis of the Titles
Leonidas Akritidis, Athanasios Fevgas, Panayiotis Bozanis
Department of Electrical and Computer Engineering
Data Structuring and Engineering Lab
University of Thessaly
The 30th IEEE International Conference on Tools
with Artificial Intelligence (ICTAI 2018)
November 5-7, 2018, Volos, Greece
E-commerce
• Large growth rate:
– E-commerce share of retail sales worldwide (2015): 7.4%.
– E-commerce share of retail sales worldwide (2018): 11.9%.
– Predicted E-commerce share of retail sales worldwide
(2021): 17.5%1
• The related research problems have been rendered
increasingly important.
• Effective and efficient management, processing, and
mining of products data are examples of such
problems.
1https://www.statista.com/statistics/534123/e-commerce-share-of-retailsales-worldwide/
L. Akritidis, A. Fevgas, P. Bozanis 2IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Products Categorization
• One of the most important problems in the area.
• Given a product and a set of categories, it is
required that we determine the category where
the product belongs to.
• It leads to numerous novel applications:
– Query expansion and rewriting.
– Relevant/Similar products retrieval.
– Personalized recommendations etc.
L. Akritidis, A. Fevgas, P. Bozanis 3IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Attributes or not?
• Relevant work is divided into two categories:
– those which are based on the products titles only and
– those which take into consideration additional
properties of the product (brand name, attributes,
technical characteristics, etc).
• However, such metadata is not always present;
even if it is present, it is frequently incomplete,
ambiguous, inconsistent, or incorrect.
• The proposed method belongs to the first
category and operates by accessing the titles only.
L. Akritidis, A. Fevgas, P. Bozanis 4IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Theoretical Background
• Let Y be the set of all categories.
• The categories are usually organized into a tree
structure with parent and leaf categories.
• A product can only be assigned one leaf category
in the aforementioned tree.
• Each product is described by its title τ.
• The title has the words (w1, w2, … ,wlx).
L. Akritidis, A. Fevgas, P. Bozanis 5IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
N-grams vs. words
• The proposed method performs morphological
analysis of the titles and extracts n-grams of
variable sizes.
• The reason of employing n-grams is that a n-gram
is less ambiguous than single words.
• For instance, a brand name (e.g. Apple) may be
correlated with multiple diverse categories
(mobile phones, tablets, computers, etc).
• However, the bigram Apple iPhone is much more
specific.
L. Akritidis, A. Fevgas, P. Bozanis 6IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Tokens and Ambiguity
• We collectively refer to all n-grams and words as
tokens.
• Each token has its own level of ambiguity.
– Ambiguous tokens are not tightly correlated with a
single category; they can be connected with multiple
categories.
• Or, each token is of different importance.
• According to the previous example, longer tokens
are less ambiguous, that is, more important.
L. Akritidis, A. Fevgas, P. Bozanis 7IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Importance Scores
• Furthermore, a token is more important if it has
been correlated with only a few categories.
• Based on these notifications, we introduce the
following importance score for a token t.
– |Y|: the total number of categories,
– ft: the frequency of t, i.e. the number of the categories
which have been correlated with t, and
– lt: the length (in words) of t.
L. Akritidis, A. Fevgas, P. Bozanis 8IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Categorization Model: Training Phase
• The training phase builds a lexicon L for the tokens.
Each entry in the lexicon stores:
• The token t,
• Its frequency ft,
• Its importance score It,
• A relevance description vector (RDV) which for each
token-category relationship, includes a pair in the
form (y, ft,y), where y is a category and ft,y is another
frequency value that reflects how many times t has
been correlated with the category y.
L. Akritidis, A. Fevgas, P. Bozanis 9IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Model Training Algorithm
L. Akritidis, A. Fevgas, P. Bozanis 10IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Categorization Model: Testing Phase
• The testing phase is based on the lexicon L of the
previous phase.
• Initially, an empty candidates list Y’is created.
• In the sequel, for each token t of each product p,
we perform a search in L.
• In case it is successful, we retrieve the RDV, ft and
It of t. Then, we traverse the RDV and for each
entry y we update the candidates list Y’.
L. Akritidis, A. Fevgas, P. Bozanis 11IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Candidates Scoring
• The score St,y of each candidate category is
expressed as linear combination of the
importance score of the token and a quantity Qt,y:
• Finally, the candidates are sorted in decreasing
St,y order and the top candidate is selected as a
category for the product.
L. Akritidis, A. Fevgas, P. Bozanis 12IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Categorization Algorithm
L. Akritidis, A. Fevgas, P. Bozanis 13IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Model Pruning
• We may decrease the size of the lexicon L by
preserving only the tokens whose importance
score exceeds a threshold C:
• We will demonstrate experimentally that this
choice combines a significant reduction of the
size of L, with infinitesimal losses in the accuracy
of the algorithm.
L. Akritidis, A. Fevgas, P. Bozanis 14IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Experimental Setup
• Dataset: 313,706 products & 230 (191 leaf)
categories from shopmania.com.
• Training/Test Set sizes: 60%/40%.
• Four scenarios for the extracted n-grams:
– N=1: Extract unigrams (single words) only.
– N=2: Extract unigrams & bigrams.
– N=3: Extract unigrams, bigrams, & trigrams.
– N=4: Extract 1-, 2-, 3-, and 4-grams.
• 0.0 < T < 1.0 with steps of 0.1.
L. Akritidis, A. Fevgas, P. Bozanis 15IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Examined Methods
• SPC – Supervised Products Classifier – the
proposed algorithm.
• LogReg – Logistic Regression.
• RanFor – Random Forests.
L. Akritidis, A. Fevgas, P. Bozanis 16IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Accuracy Evaluation
L. Akritidis, A. Fevgas, P. Bozanis 17IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Highest Accuracy:
95.1% for N=3, T=0.
Interesting
Measurement:
Accuracy 93% for
N=3, T=0.7.
Pruned Model Evaluation
L. Akritidis, A. Fevgas, P. Bozanis 18IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Recall: For N=3, T=0.7,
accuracy 93% with 47%
fewer tokens and 54%
fewer RDV entries
Conclusions
• We presented a supervised learning algorithm for
products categorization.
• It trains a classification model based on:
– The morphological analysis of the titles,
– The extraction of n-grams of variable sizes,
– The assignment of importance scores to each token.
• The method achieves ~95% classification accuracy.
• It also embodies a self-pruning strategy.
• The experiments have demonstrated that this
strategy leads to a reduction of about 50% in the size
of the model combined with small losses in the
classifier performance.
L. Akritidis, A. Fevgas, P. Tosmpanopoulou P. Bozanis 19IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Thank You
Any Questions?
L. Akritidis, A. Fevgas, P. Tosmpanopoulou P. Bozanis 20IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece

More Related Content

Similar to Effective Products Categorization with Importance Scores and Morphological Analysis of the Titles

Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...
Aliaksandr Birukou
 
Mastering an ontology & vocabulary management technology in France ?
Mastering an ontology & vocabulary management technology in France ?Mastering an ontology & vocabulary management technology in France ?
Mastering an ontology & vocabulary management technology in France ?
INRAE (MISTEA) and University of Montpellier (LIRMM)
 
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
webwinkelvakdag
 
Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...
Anastasija Nikiforova
 
Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013
The Hebrew University of Jerusalem
 
CASE Network Studies and Analyses 393 - Are Unit Export Values Correct Maasur...
CASE Network Studies and Analyses 393 - Are Unit Export Values Correct Maasur...CASE Network Studies and Analyses 393 - Are Unit Export Values Correct Maasur...
CASE Network Studies and Analyses 393 - Are Unit Export Values Correct Maasur...
CASE Center for Social and Economic Research
 
Interannotator Agreement
Interannotator AgreementInterannotator Agreement
Interannotator Agreement
john6938
 
Preservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas RauberPreservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
JISC KeepIt project
 
Evalita2018 iListen - itaLIan Speech acT labEliNg
Evalita2018 iListen - itaLIan Speech acT labEliNgEvalita2018 iListen - itaLIan Speech acT labEliNg
Evalita2018 iListen - itaLIan Speech acT labEliNg
Nicole Novielli
 
Metadata catalogues survey results, EOSCpilot H2020 EU project
Metadata catalogues survey results, EOSCpilot H2020 EU projectMetadata catalogues survey results, EOSCpilot H2020 EU project
Metadata catalogues survey results, EOSCpilot H2020 EU project
Massimiliano Assante
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
Mounia Lalmas-Roelleke
 
Příklad bibliometrické zprávy
Příklad bibliometrické zprávyPříklad bibliometrické zprávy
Příklad bibliometrické zprávy
MEYS, MŠMT in Czech
 
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Marina Santini
 
Evidence-based Semantic Web Just a Dream or the Way to Go?
Evidence-based Semantic WebJust a Dream or the Way to Go?Evidence-based Semantic WebJust a Dream or the Way to Go?
Evidence-based Semantic Web Just a Dream or the Way to Go?
Dragan Gasevic
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
Dr. Haxel Consult
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Olga Scrivner
 
Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce
Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerceOff-line vs. On-line Evaluation of Recommender Systems in Small E-commerce
Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce
Ladislav Peska
 
Report on the First Knowledge Graph Reasoning Challenge 2018 -Toward the eXp...
Report on the First Knowledge Graph Reasoning Challenge  2018 -Toward the eXp...Report on the First Knowledge Graph Reasoning Challenge  2018 -Toward the eXp...
Report on the First Knowledge Graph Reasoning Challenge 2018 -Toward the eXp...
KnowledgeGraph
 
Oeb08 Dec08 Tyamada
Oeb08 Dec08 TyamadaOeb08 Dec08 Tyamada
Oeb08 Dec08 Tyamada
tsyamada
 

Similar to Effective Products Categorization with Importance Scores and Morphological Analysis of the Titles (20)

Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...
 
Mastering an ontology & vocabulary management technology in France ?
Mastering an ontology & vocabulary management technology in France ?Mastering an ontology & vocabulary management technology in France ?
Mastering an ontology & vocabulary management technology in France ?
 
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
 
Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...
 
Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013
 
CASE Network Studies and Analyses 393 - Are Unit Export Values Correct Maasur...
CASE Network Studies and Analyses 393 - Are Unit Export Values Correct Maasur...CASE Network Studies and Analyses 393 - Are Unit Export Values Correct Maasur...
CASE Network Studies and Analyses 393 - Are Unit Export Values Correct Maasur...
 
Interannotator Agreement
Interannotator AgreementInterannotator Agreement
Interannotator Agreement
 
Preservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas RauberPreservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
 
Evalita2018 iListen - itaLIan Speech acT labEliNg
Evalita2018 iListen - itaLIan Speech acT labEliNgEvalita2018 iListen - itaLIan Speech acT labEliNg
Evalita2018 iListen - itaLIan Speech acT labEliNg
 
Metadata catalogues survey results, EOSCpilot H2020 EU project
Metadata catalogues survey results, EOSCpilot H2020 EU projectMetadata catalogues survey results, EOSCpilot H2020 EU project
Metadata catalogues survey results, EOSCpilot H2020 EU project
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Příklad bibliometrické zprávy
Příklad bibliometrické zprávyPříklad bibliometrické zprávy
Příklad bibliometrické zprávy
 
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Evidence-based Semantic Web Just a Dream or the Way to Go?
Evidence-based Semantic WebJust a Dream or the Way to Go?Evidence-based Semantic WebJust a Dream or the Way to Go?
Evidence-based Semantic Web Just a Dream or the Way to Go?
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
 
Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce
Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerceOff-line vs. On-line Evaluation of Recommender Systems in Small E-commerce
Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce
 
Report on the First Knowledge Graph Reasoning Challenge 2018 -Toward the eXp...
Report on the First Knowledge Graph Reasoning Challenge  2018 -Toward the eXp...Report on the First Knowledge Graph Reasoning Challenge  2018 -Toward the eXp...
Report on the First Knowledge Graph Reasoning Challenge 2018 -Toward the eXp...
 
Oeb08 Dec08 Tyamada
Oeb08 Dec08 TyamadaOeb08 Dec08 Tyamada
Oeb08 Dec08 Tyamada
 

More from Leonidas Akritidis

An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank AggregationAn Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
Leonidas Akritidis
 
Effective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product TitlesEffective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product Titles
Leonidas Akritidis
 
A Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research ArticlesA Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research Articles
Leonidas Akritidis
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Leonidas Akritidis
 
Positional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesPositional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted Indexes
Leonidas Akritidis
 
Identifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does MatterIdentifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does Matter
Leonidas Akritidis
 

More from Leonidas Akritidis (6)

An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank AggregationAn Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
 
Effective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product TitlesEffective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product Titles
 
A Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research ArticlesA Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research Articles
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Positional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesPositional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted Indexes
 
Identifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does MatterIdentifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does Matter
 

Recently uploaded

KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
rpskprasana
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
nooriasukmaningtyas
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
zubairahmad848137
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
enizeyimana36
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 

Recently uploaded (20)

KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 

Effective Products Categorization with Importance Scores and Morphological Analysis of the Titles

  • 1. Effective Products Categorization with Importance Scores and Morphological Analysis of the Titles Leonidas Akritidis, Athanasios Fevgas, Panayiotis Bozanis Department of Electrical and Computer Engineering Data Structuring and Engineering Lab University of Thessaly The 30th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2018) November 5-7, 2018, Volos, Greece
  • 2. E-commerce • Large growth rate: – E-commerce share of retail sales worldwide (2015): 7.4%. – E-commerce share of retail sales worldwide (2018): 11.9%. – Predicted E-commerce share of retail sales worldwide (2021): 17.5%1 • The related research problems have been rendered increasingly important. • Effective and efficient management, processing, and mining of products data are examples of such problems. 1https://www.statista.com/statistics/534123/e-commerce-share-of-retailsales-worldwide/ L. Akritidis, A. Fevgas, P. Bozanis 2IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 3. Products Categorization • One of the most important problems in the area. • Given a product and a set of categories, it is required that we determine the category where the product belongs to. • It leads to numerous novel applications: – Query expansion and rewriting. – Relevant/Similar products retrieval. – Personalized recommendations etc. L. Akritidis, A. Fevgas, P. Bozanis 3IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 4. Attributes or not? • Relevant work is divided into two categories: – those which are based on the products titles only and – those which take into consideration additional properties of the product (brand name, attributes, technical characteristics, etc). • However, such metadata is not always present; even if it is present, it is frequently incomplete, ambiguous, inconsistent, or incorrect. • The proposed method belongs to the first category and operates by accessing the titles only. L. Akritidis, A. Fevgas, P. Bozanis 4IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 5. Theoretical Background • Let Y be the set of all categories. • The categories are usually organized into a tree structure with parent and leaf categories. • A product can only be assigned one leaf category in the aforementioned tree. • Each product is described by its title τ. • The title has the words (w1, w2, … ,wlx). L. Akritidis, A. Fevgas, P. Bozanis 5IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 6. N-grams vs. words • The proposed method performs morphological analysis of the titles and extracts n-grams of variable sizes. • The reason of employing n-grams is that a n-gram is less ambiguous than single words. • For instance, a brand name (e.g. Apple) may be correlated with multiple diverse categories (mobile phones, tablets, computers, etc). • However, the bigram Apple iPhone is much more specific. L. Akritidis, A. Fevgas, P. Bozanis 6IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 7. Tokens and Ambiguity • We collectively refer to all n-grams and words as tokens. • Each token has its own level of ambiguity. – Ambiguous tokens are not tightly correlated with a single category; they can be connected with multiple categories. • Or, each token is of different importance. • According to the previous example, longer tokens are less ambiguous, that is, more important. L. Akritidis, A. Fevgas, P. Bozanis 7IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 8. Importance Scores • Furthermore, a token is more important if it has been correlated with only a few categories. • Based on these notifications, we introduce the following importance score for a token t. – |Y|: the total number of categories, – ft: the frequency of t, i.e. the number of the categories which have been correlated with t, and – lt: the length (in words) of t. L. Akritidis, A. Fevgas, P. Bozanis 8IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 9. Categorization Model: Training Phase • The training phase builds a lexicon L for the tokens. Each entry in the lexicon stores: • The token t, • Its frequency ft, • Its importance score It, • A relevance description vector (RDV) which for each token-category relationship, includes a pair in the form (y, ft,y), where y is a category and ft,y is another frequency value that reflects how many times t has been correlated with the category y. L. Akritidis, A. Fevgas, P. Bozanis 9IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 10. Model Training Algorithm L. Akritidis, A. Fevgas, P. Bozanis 10IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 11. Categorization Model: Testing Phase • The testing phase is based on the lexicon L of the previous phase. • Initially, an empty candidates list Y’is created. • In the sequel, for each token t of each product p, we perform a search in L. • In case it is successful, we retrieve the RDV, ft and It of t. Then, we traverse the RDV and for each entry y we update the candidates list Y’. L. Akritidis, A. Fevgas, P. Bozanis 11IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 12. Candidates Scoring • The score St,y of each candidate category is expressed as linear combination of the importance score of the token and a quantity Qt,y: • Finally, the candidates are sorted in decreasing St,y order and the top candidate is selected as a category for the product. L. Akritidis, A. Fevgas, P. Bozanis 12IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 13. Categorization Algorithm L. Akritidis, A. Fevgas, P. Bozanis 13IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 14. Model Pruning • We may decrease the size of the lexicon L by preserving only the tokens whose importance score exceeds a threshold C: • We will demonstrate experimentally that this choice combines a significant reduction of the size of L, with infinitesimal losses in the accuracy of the algorithm. L. Akritidis, A. Fevgas, P. Bozanis 14IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 15. Experimental Setup • Dataset: 313,706 products & 230 (191 leaf) categories from shopmania.com. • Training/Test Set sizes: 60%/40%. • Four scenarios for the extracted n-grams: – N=1: Extract unigrams (single words) only. – N=2: Extract unigrams & bigrams. – N=3: Extract unigrams, bigrams, & trigrams. – N=4: Extract 1-, 2-, 3-, and 4-grams. • 0.0 < T < 1.0 with steps of 0.1. L. Akritidis, A. Fevgas, P. Bozanis 15IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 16. Examined Methods • SPC – Supervised Products Classifier – the proposed algorithm. • LogReg – Logistic Regression. • RanFor – Random Forests. L. Akritidis, A. Fevgas, P. Bozanis 16IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 17. Accuracy Evaluation L. Akritidis, A. Fevgas, P. Bozanis 17IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece Highest Accuracy: 95.1% for N=3, T=0. Interesting Measurement: Accuracy 93% for N=3, T=0.7.
  • 18. Pruned Model Evaluation L. Akritidis, A. Fevgas, P. Bozanis 18IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece Recall: For N=3, T=0.7, accuracy 93% with 47% fewer tokens and 54% fewer RDV entries
  • 19. Conclusions • We presented a supervised learning algorithm for products categorization. • It trains a classification model based on: – The morphological analysis of the titles, – The extraction of n-grams of variable sizes, – The assignment of importance scores to each token. • The method achieves ~95% classification accuracy. • It also embodies a self-pruning strategy. • The experiments have demonstrated that this strategy leads to a reduction of about 50% in the size of the model combined with small losses in the classifier performance. L. Akritidis, A. Fevgas, P. Tosmpanopoulou P. Bozanis 19IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 20. Thank You Any Questions? L. Akritidis, A. Fevgas, P. Tosmpanopoulou P. Bozanis 20IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece