SlideShare a Scribd company logo
1 of 20
Download to read offline
Effective Products Categorization with Importance
Scores and Morphological Analysis of the Titles
Leonidas Akritidis, Athanasios Fevgas, Panayiotis Bozanis
Department of Electrical and Computer Engineering
Data Structuring and Engineering Lab
University of Thessaly
The 30th IEEE International Conference on Tools
with Artificial Intelligence (ICTAI 2018)
November 5-7, 2018, Volos, Greece
E-commerce
• Large growth rate:
– E-commerce share of retail sales worldwide (2015): 7.4%.
– E-commerce share of retail sales worldwide (2018): 11.9%.
– Predicted E-commerce share of retail sales worldwide
(2021): 17.5%1
• The related research problems have been rendered
increasingly important.
• Effective and efficient management, processing, and
mining of products data are examples of such
problems.
1https://www.statista.com/statistics/534123/e-commerce-share-of-retailsales-worldwide/
L. Akritidis, A. Fevgas, P. Bozanis 2IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Products Categorization
• One of the most important problems in the area.
• Given a product and a set of categories, it is
required that we determine the category where
the product belongs to.
• It leads to numerous novel applications:
– Query expansion and rewriting.
– Relevant/Similar products retrieval.
– Personalized recommendations etc.
L. Akritidis, A. Fevgas, P. Bozanis 3IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Attributes or not?
• Relevant work is divided into two categories:
– those which are based on the products titles only and
– those which take into consideration additional
properties of the product (brand name, attributes,
technical characteristics, etc).
• However, such metadata is not always present;
even if it is present, it is frequently incomplete,
ambiguous, inconsistent, or incorrect.
• The proposed method belongs to the first
category and operates by accessing the titles only.
L. Akritidis, A. Fevgas, P. Bozanis 4IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Theoretical Background
• Let Y be the set of all categories.
• The categories are usually organized into a tree
structure with parent and leaf categories.
• A product can only be assigned one leaf category
in the aforementioned tree.
• Each product is described by its title τ.
• The title has the words (w1, w2, … ,wlx).
L. Akritidis, A. Fevgas, P. Bozanis 5IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
N-grams vs. words
• The proposed method performs morphological
analysis of the titles and extracts n-grams of
variable sizes.
• The reason of employing n-grams is that a n-gram
is less ambiguous than single words.
• For instance, a brand name (e.g. Apple) may be
correlated with multiple diverse categories
(mobile phones, tablets, computers, etc).
• However, the bigram Apple iPhone is much more
specific.
L. Akritidis, A. Fevgas, P. Bozanis 6IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Tokens and Ambiguity
• We collectively refer to all n-grams and words as
tokens.
• Each token has its own level of ambiguity.
– Ambiguous tokens are not tightly correlated with a
single category; they can be connected with multiple
categories.
• Or, each token is of different importance.
• According to the previous example, longer tokens
are less ambiguous, that is, more important.
L. Akritidis, A. Fevgas, P. Bozanis 7IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Importance Scores
• Furthermore, a token is more important if it has
been correlated with only a few categories.
• Based on these notifications, we introduce the
following importance score for a token t.
– |Y|: the total number of categories,
– ft: the frequency of t, i.e. the number of the categories
which have been correlated with t, and
– lt: the length (in words) of t.
L. Akritidis, A. Fevgas, P. Bozanis 8IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Categorization Model: Training Phase
• The training phase builds a lexicon L for the tokens.
Each entry in the lexicon stores:
• The token t,
• Its frequency ft,
• Its importance score It,
• A relevance description vector (RDV) which for each
token-category relationship, includes a pair in the
form (y, ft,y), where y is a category and ft,y is another
frequency value that reflects how many times t has
been correlated with the category y.
L. Akritidis, A. Fevgas, P. Bozanis 9IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Model Training Algorithm
L. Akritidis, A. Fevgas, P. Bozanis 10IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Categorization Model: Testing Phase
• The testing phase is based on the lexicon L of the
previous phase.
• Initially, an empty candidates list Y’is created.
• In the sequel, for each token t of each product p,
we perform a search in L.
• In case it is successful, we retrieve the RDV, ft and
It of t. Then, we traverse the RDV and for each
entry y we update the candidates list Y’.
L. Akritidis, A. Fevgas, P. Bozanis 11IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Candidates Scoring
• The score St,y of each candidate category is
expressed as linear combination of the
importance score of the token and a quantity Qt,y:
• Finally, the candidates are sorted in decreasing
St,y order and the top candidate is selected as a
category for the product.
L. Akritidis, A. Fevgas, P. Bozanis 12IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Categorization Algorithm
L. Akritidis, A. Fevgas, P. Bozanis 13IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Model Pruning
• We may decrease the size of the lexicon L by
preserving only the tokens whose importance
score exceeds a threshold C:
• We will demonstrate experimentally that this
choice combines a significant reduction of the
size of L, with infinitesimal losses in the accuracy
of the algorithm.
L. Akritidis, A. Fevgas, P. Bozanis 14IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Experimental Setup
• Dataset: 313,706 products & 230 (191 leaf)
categories from shopmania.com.
• Training/Test Set sizes: 60%/40%.
• Four scenarios for the extracted n-grams:
– N=1: Extract unigrams (single words) only.
– N=2: Extract unigrams & bigrams.
– N=3: Extract unigrams, bigrams, & trigrams.
– N=4: Extract 1-, 2-, 3-, and 4-grams.
• 0.0 < T < 1.0 with steps of 0.1.
L. Akritidis, A. Fevgas, P. Bozanis 15IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Examined Methods
• SPC – Supervised Products Classifier – the
proposed algorithm.
• LogReg – Logistic Regression.
• RanFor – Random Forests.
L. Akritidis, A. Fevgas, P. Bozanis 16IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Accuracy Evaluation
L. Akritidis, A. Fevgas, P. Bozanis 17IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Highest Accuracy:
95.1% for N=3, T=0.
Interesting
Measurement:
Accuracy 93% for
N=3, T=0.7.
Pruned Model Evaluation
L. Akritidis, A. Fevgas, P. Bozanis 18IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Recall: For N=3, T=0.7,
accuracy 93% with 47%
fewer tokens and 54%
fewer RDV entries
Conclusions
• We presented a supervised learning algorithm for
products categorization.
• It trains a classification model based on:
– The morphological analysis of the titles,
– The extraction of n-grams of variable sizes,
– The assignment of importance scores to each token.
• The method achieves ~95% classification accuracy.
• It also embodies a self-pruning strategy.
• The experiments have demonstrated that this
strategy leads to a reduction of about 50% in the size
of the model combined with small losses in the
classifier performance.
L. Akritidis, A. Fevgas, P. Tosmpanopoulou P. Bozanis 19IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
Thank You
Any Questions?
L. Akritidis, A. Fevgas, P. Tosmpanopoulou P. Bozanis 20IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece

More Related Content

Similar to Effective Products Categorization with Importance Scores and Morphological Analysis of the Titles

Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Aliaksandr Birukou
 
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019webwinkelvakdag
 
Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...Anastasija Nikiforova
 
Interannotator Agreement
Interannotator AgreementInterannotator Agreement
Interannotator Agreementjohn6938
 
Preservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas RauberPreservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas RauberJISC KeepIt project
 
Evalita2018 iListen - itaLIan Speech acT labEliNg
Evalita2018 iListen - itaLIan Speech acT labEliNgEvalita2018 iListen - itaLIan Speech acT labEliNg
Evalita2018 iListen - itaLIan Speech acT labEliNgNicole Novielli
 
Metadata catalogues survey results, EOSCpilot H2020 EU project
Metadata catalogues survey results, EOSCpilot H2020 EU projectMetadata catalogues survey results, EOSCpilot H2020 EU project
Metadata catalogues survey results, EOSCpilot H2020 EU projectMassimiliano Assante
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalMounia Lalmas-Roelleke
 
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Marina Santini
 
Evidence-based Semantic Web Just a Dream or the Way to Go?
Evidence-based Semantic WebJust a Dream or the Way to Go?Evidence-based Semantic WebJust a Dream or the Way to Go?
Evidence-based Semantic Web Just a Dream or the Way to Go?Dragan Gasevic
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...Dr. Haxel Consult
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisOlga Scrivner
 
Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce
Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerceOff-line vs. On-line Evaluation of Recommender Systems in Small E-commerce
Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerceLadislav Peska
 
Report on the First Knowledge Graph Reasoning Challenge 2018 -Toward the eXp...
Report on the First Knowledge Graph Reasoning Challenge  2018 -Toward the eXp...Report on the First Knowledge Graph Reasoning Challenge  2018 -Toward the eXp...
Report on the First Knowledge Graph Reasoning Challenge 2018 -Toward the eXp...KnowledgeGraph
 
Oeb08 Dec08 Tyamada
Oeb08 Dec08 TyamadaOeb08 Dec08 Tyamada
Oeb08 Dec08 Tyamadatsyamada
 
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...eMadrid network
 

Similar to Effective Products Categorization with Importance Scores and Morphological Analysis of the Titles (20)

Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...
 
Mastering an ontology & vocabulary management technology in France ?
Mastering an ontology & vocabulary management technology in France ?Mastering an ontology & vocabulary management technology in France ?
Mastering an ontology & vocabulary management technology in France ?
 
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
GENERALIZED LANGUAGE MODELS FOR CASE LAW RETRIEVAL - Big Data Expo 2019
 
Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...
 
Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013
 
CASE Network Studies and Analyses 393 - Are Unit Export Values Correct Maasur...
CASE Network Studies and Analyses 393 - Are Unit Export Values Correct Maasur...CASE Network Studies and Analyses 393 - Are Unit Export Values Correct Maasur...
CASE Network Studies and Analyses 393 - Are Unit Export Values Correct Maasur...
 
Interannotator Agreement
Interannotator AgreementInterannotator Agreement
Interannotator Agreement
 
Preservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas RauberPreservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
 
Evalita2018 iListen - itaLIan Speech acT labEliNg
Evalita2018 iListen - itaLIan Speech acT labEliNgEvalita2018 iListen - itaLIan Speech acT labEliNg
Evalita2018 iListen - itaLIan Speech acT labEliNg
 
Metadata catalogues survey results, EOSCpilot H2020 EU project
Metadata catalogues survey results, EOSCpilot H2020 EU projectMetadata catalogues survey results, EOSCpilot H2020 EU project
Metadata catalogues survey results, EOSCpilot H2020 EU project
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Příklad bibliometrické zprávy
Příklad bibliometrické zprávyPříklad bibliometrické zprávy
Příklad bibliometrické zprávy
 
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Evidence-based Semantic Web Just a Dream or the Way to Go?
Evidence-based Semantic WebJust a Dream or the Way to Go?Evidence-based Semantic WebJust a Dream or the Way to Go?
Evidence-based Semantic Web Just a Dream or the Way to Go?
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
 
Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce
Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerceOff-line vs. On-line Evaluation of Recommender Systems in Small E-commerce
Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce
 
Report on the First Knowledge Graph Reasoning Challenge 2018 -Toward the eXp...
Report on the First Knowledge Graph Reasoning Challenge  2018 -Toward the eXp...Report on the First Knowledge Graph Reasoning Challenge  2018 -Toward the eXp...
Report on the First Knowledge Graph Reasoning Challenge 2018 -Toward the eXp...
 
Oeb08 Dec08 Tyamada
Oeb08 Dec08 TyamadaOeb08 Dec08 Tyamada
Oeb08 Dec08 Tyamada
 
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
 

More from Leonidas Akritidis

An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank AggregationAn Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank AggregationLeonidas Akritidis
 
Effective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product TitlesEffective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product TitlesLeonidas Akritidis
 
A Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research ArticlesA Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research ArticlesLeonidas Akritidis
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceLeonidas Akritidis
 
Positional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesPositional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesLeonidas Akritidis
 
Identifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does MatterIdentifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does MatterLeonidas Akritidis
 

More from Leonidas Akritidis (6)

An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank AggregationAn Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
 
Effective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product TitlesEffective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product Titles
 
A Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research ArticlesA Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research Articles
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Positional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesPositional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted Indexes
 
Identifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does MatterIdentifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does Matter
 

Recently uploaded

Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and ToolsMaximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Toolssoginsider
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksIJECEIAES
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxMustafa Ahmed
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdfKamal Acharya
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxKarpagam Institute of Teechnology
 
History of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & ModernizationHistory of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & ModernizationEmaan Sharma
 
CLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalCLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalSwarnaSLcse
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Studentskannan348865
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfJNTUA
 
Working Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfWorking Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfSkNahidulIslamShrabo
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...Amil baba
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfssuser5c9d4b1
 
Adsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptAdsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptjigup7320
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docxrahulmanepalli02
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsVIEW
 
Independent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging StationIndependent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging Stationsiddharthteach18
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxCHAIRMAN M
 
Dynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptxDynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptxMustafa Ahmed
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1T.D. Shashikala
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfJNTUA
 

Recently uploaded (20)

Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and ToolsMaximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptx
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
History of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & ModernizationHistory of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & Modernization
 
CLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalCLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference Modal
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Students
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
Working Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfWorking Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdf
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdf
 
Adsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptAdsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) ppt
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, Functions
 
Independent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging StationIndependent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging Station
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
Dynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptxDynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptx
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 

Effective Products Categorization with Importance Scores and Morphological Analysis of the Titles

  • 1. Effective Products Categorization with Importance Scores and Morphological Analysis of the Titles Leonidas Akritidis, Athanasios Fevgas, Panayiotis Bozanis Department of Electrical and Computer Engineering Data Structuring and Engineering Lab University of Thessaly The 30th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2018) November 5-7, 2018, Volos, Greece
  • 2. E-commerce • Large growth rate: – E-commerce share of retail sales worldwide (2015): 7.4%. – E-commerce share of retail sales worldwide (2018): 11.9%. – Predicted E-commerce share of retail sales worldwide (2021): 17.5%1 • The related research problems have been rendered increasingly important. • Effective and efficient management, processing, and mining of products data are examples of such problems. 1https://www.statista.com/statistics/534123/e-commerce-share-of-retailsales-worldwide/ L. Akritidis, A. Fevgas, P. Bozanis 2IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 3. Products Categorization • One of the most important problems in the area. • Given a product and a set of categories, it is required that we determine the category where the product belongs to. • It leads to numerous novel applications: – Query expansion and rewriting. – Relevant/Similar products retrieval. – Personalized recommendations etc. L. Akritidis, A. Fevgas, P. Bozanis 3IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 4. Attributes or not? • Relevant work is divided into two categories: – those which are based on the products titles only and – those which take into consideration additional properties of the product (brand name, attributes, technical characteristics, etc). • However, such metadata is not always present; even if it is present, it is frequently incomplete, ambiguous, inconsistent, or incorrect. • The proposed method belongs to the first category and operates by accessing the titles only. L. Akritidis, A. Fevgas, P. Bozanis 4IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 5. Theoretical Background • Let Y be the set of all categories. • The categories are usually organized into a tree structure with parent and leaf categories. • A product can only be assigned one leaf category in the aforementioned tree. • Each product is described by its title τ. • The title has the words (w1, w2, … ,wlx). L. Akritidis, A. Fevgas, P. Bozanis 5IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 6. N-grams vs. words • The proposed method performs morphological analysis of the titles and extracts n-grams of variable sizes. • The reason of employing n-grams is that a n-gram is less ambiguous than single words. • For instance, a brand name (e.g. Apple) may be correlated with multiple diverse categories (mobile phones, tablets, computers, etc). • However, the bigram Apple iPhone is much more specific. L. Akritidis, A. Fevgas, P. Bozanis 6IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 7. Tokens and Ambiguity • We collectively refer to all n-grams and words as tokens. • Each token has its own level of ambiguity. – Ambiguous tokens are not tightly correlated with a single category; they can be connected with multiple categories. • Or, each token is of different importance. • According to the previous example, longer tokens are less ambiguous, that is, more important. L. Akritidis, A. Fevgas, P. Bozanis 7IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 8. Importance Scores • Furthermore, a token is more important if it has been correlated with only a few categories. • Based on these notifications, we introduce the following importance score for a token t. – |Y|: the total number of categories, – ft: the frequency of t, i.e. the number of the categories which have been correlated with t, and – lt: the length (in words) of t. L. Akritidis, A. Fevgas, P. Bozanis 8IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 9. Categorization Model: Training Phase • The training phase builds a lexicon L for the tokens. Each entry in the lexicon stores: • The token t, • Its frequency ft, • Its importance score It, • A relevance description vector (RDV) which for each token-category relationship, includes a pair in the form (y, ft,y), where y is a category and ft,y is another frequency value that reflects how many times t has been correlated with the category y. L. Akritidis, A. Fevgas, P. Bozanis 9IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 10. Model Training Algorithm L. Akritidis, A. Fevgas, P. Bozanis 10IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 11. Categorization Model: Testing Phase • The testing phase is based on the lexicon L of the previous phase. • Initially, an empty candidates list Y’is created. • In the sequel, for each token t of each product p, we perform a search in L. • In case it is successful, we retrieve the RDV, ft and It of t. Then, we traverse the RDV and for each entry y we update the candidates list Y’. L. Akritidis, A. Fevgas, P. Bozanis 11IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 12. Candidates Scoring • The score St,y of each candidate category is expressed as linear combination of the importance score of the token and a quantity Qt,y: • Finally, the candidates are sorted in decreasing St,y order and the top candidate is selected as a category for the product. L. Akritidis, A. Fevgas, P. Bozanis 12IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 13. Categorization Algorithm L. Akritidis, A. Fevgas, P. Bozanis 13IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 14. Model Pruning • We may decrease the size of the lexicon L by preserving only the tokens whose importance score exceeds a threshold C: • We will demonstrate experimentally that this choice combines a significant reduction of the size of L, with infinitesimal losses in the accuracy of the algorithm. L. Akritidis, A. Fevgas, P. Bozanis 14IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 15. Experimental Setup • Dataset: 313,706 products & 230 (191 leaf) categories from shopmania.com. • Training/Test Set sizes: 60%/40%. • Four scenarios for the extracted n-grams: – N=1: Extract unigrams (single words) only. – N=2: Extract unigrams & bigrams. – N=3: Extract unigrams, bigrams, & trigrams. – N=4: Extract 1-, 2-, 3-, and 4-grams. • 0.0 < T < 1.0 with steps of 0.1. L. Akritidis, A. Fevgas, P. Bozanis 15IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 16. Examined Methods • SPC – Supervised Products Classifier – the proposed algorithm. • LogReg – Logistic Regression. • RanFor – Random Forests. L. Akritidis, A. Fevgas, P. Bozanis 16IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 17. Accuracy Evaluation L. Akritidis, A. Fevgas, P. Bozanis 17IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece Highest Accuracy: 95.1% for N=3, T=0. Interesting Measurement: Accuracy 93% for N=3, T=0.7.
  • 18. Pruned Model Evaluation L. Akritidis, A. Fevgas, P. Bozanis 18IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece Recall: For N=3, T=0.7, accuracy 93% with 47% fewer tokens and 54% fewer RDV entries
  • 19. Conclusions • We presented a supervised learning algorithm for products categorization. • It trains a classification model based on: – The morphological analysis of the titles, – The extraction of n-grams of variable sizes, – The assignment of importance scores to each token. • The method achieves ~95% classification accuracy. • It also embodies a self-pruning strategy. • The experiments have demonstrated that this strategy leads to a reduction of about 50% in the size of the model combined with small losses in the classifier performance. L. Akritidis, A. Fevgas, P. Tosmpanopoulou P. Bozanis 19IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece
  • 20. Thank You Any Questions? L. Akritidis, A. Fevgas, P. Tosmpanopoulou P. Bozanis 20IEEE ICTAI 2018, November 5-7, 2018, Volos, Greece