SlideShare a Scribd company logo
1 of 14
Download to read offline
Classification of E-commerce Websites by
Product Categories
Case Study
Moiseev George
Higher School of Economics
Faculty of Computer Science
Higher School of Economics , Moscow, 2016
www.hse.ru
Outline
• Introduction
• Preprocessing
• Feature extraction
• Classification and evaluation
• Experimental results
2
Problem Statement
• Retrieve e-commerce websites (e-shops)
• Classify e-shops by sold product type
*We don’t include customer-to-customer websites as e-
commerce shops
3
Applications
• Market research
• Statistics gathering
• Organizing a knowledge base
• Goods search
4
Dataset
The dataset was received by datainsight.ru
There are two training subsets marked by experts:
1. 1312 e-commerce and 1077 non e-commerce web
sites
2. 1448 of 15 product
categories.
5
Preprocessing
Downloading a website:
Starting from the main page
Download all internal hyperlinks from a web page which weren’t
downloaded before
Check if equal webpage was already downloaded by other
hyperlink
What information should be saved from other webpages:
1. Nothing
2. Only meta data
3. Everything
6
Preprocessing
Each webpage will be stored in two versions
• Raw page:
– Remove only javascript and obvious advertisements
• Cleaned page:
– Extract only content of markup tags
– Tokenization – retrieving sentences and words
– Stemming – reducing words to their root or base form
– Lowercase conversion
– Filter out stopwords
7
Feature Extraction
There many methods and models for automatic text feature
extraction:
• Bag of words
• n-grams
• word2vec
• TF-IDF (on the picture)
• Mutual information
• Chi-square
• …
8
Feature extraction
Proposed approach:
The term weighting formula for the i-th term in the k-th website is
derived from TF-IDF as follows:
𝑊𝑖𝑘 =
𝑡𝑓𝑖𝑘 log
𝑁
𝑛𝑖
(𝑡𝑓𝑖𝑗 log
𝑁
𝑛𝑗
)2𝑁
𝑗=1
where ni is the number of websites where the i-th term appears, N –
total number of web sites in the sample and tfik is computed as:
𝑡𝑓𝑖𝑘 = 𝑤(𝑡)f(𝑖, 𝑘, 𝑡)
𝑇
𝑡
Where w(t) is inversely proportional frequency of a tag t, f(i, k, t) is
frequency of the i-th term in t-th tag.
9
Classification and
evaluation
• Support Vector Machine as classifier.
• multiclass classification performs in “one-vs-all” way.
• precision, recall and F-score for evaluation
• overall performance of the product type classification is evaluated
by average F-score among all categories.
10
Results
F-score of e-commerce class in binary classification
11
Used web site
information pure TF-IDF TF-IDF with Tag
weighting
only main page 0.85 0.89
main page + meta
and title from other
pages
0.89 0.94
main page +
whole other pages 0.86 0.92
.
Results
average F-score of e-commerce categorization by sold product type:
12
.
Used web site
information pure TF-IDF TF-IDF with Tag
Weighting
only main page 0.67 0.72
main page + meta
and title from
other pages
0.74 0.79
main page +
whole other pages 0.73 0.81
References
1. A. Rahmani and S. Meshkizadeh, "Webpage Classification based
on Compound of Using HTML Features & URL Features and
Features of Sibling Pages", International Journal of Advancements in
Computing Technology, vol. 2, no. 4, pp. 36-46, 2010.
2. A. Aizawa, "An information-theoretic perspective of tf-idf measures",
Information Processing & Management, vol. 39, no. 1, pp. 45-65,
2003.
3. D. Powers, "Evaluation: From Precision, Recall and F-Measure to
ROC, Informedness, Markedness & Correlation", Journal of Machine
Learning Technologies, vol. 1, no. 2, pp. 37-63, 2011.
4. Vapnik, V., Cortez, C.: Support vector networks. Machine Learning.
(1995).
5. Ghani, R., Slattery, S., Yang, Y.: Hypertext categorization using
hyperlink patterns and meta data. ICML 01: Proceedings of the
Eighteenth International Conference on Machine Learning. 178-185
(2001).
13
.
Moiseev George
gvmoiseev@edu.hse.ru

More Related Content

Viewers also liked

Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsAIST
 
Measuring the economic impact of swimming sport events
Measuring the economic impact of swimming sport eventsMeasuring the economic impact of swimming sport events
Measuring the economic impact of swimming sport eventsAngel Barajas
 
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray ImagesAIST
 
Machine Learning in Ecommerce
Machine Learning in EcommerceMachine Learning in Ecommerce
Machine Learning in EcommerceDavid Jones
 
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...AIST
 
Parts of system unit
Parts of system unitParts of system unit
Parts of system unitkapitanbasa
 

Viewers also liked (6)

Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chants
 
Measuring the economic impact of swimming sport events
Measuring the economic impact of swimming sport eventsMeasuring the economic impact of swimming sport events
Measuring the economic impact of swimming sport events
 
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
 
Machine Learning in Ecommerce
Machine Learning in EcommerceMachine Learning in Ecommerce
Machine Learning in Ecommerce
 
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
Andrey V. Savchenko - Sequential Hierarchical Image Recognition based on the ...
 
Parts of system unit
Parts of system unitParts of system unit
Parts of system unit
 

Similar to George Moiseev - Classification of E-commerce Websites by Product Categories

RankTank tutorial (ranktank.eu/)
RankTank tutorial (ranktank.eu/)RankTank tutorial (ranktank.eu/)
RankTank tutorial (ranktank.eu/)RankTank.eu/
 
2016 siteIQ Website Evaluation Services Brochure
2016 siteIQ Website Evaluation Services Brochure2016 siteIQ Website Evaluation Services Brochure
2016 siteIQ Website Evaluation Services BrochureKenna Dian
 
Victoria Onoprienko - Effective Search Engine Reputation Management Strategies
Victoria Onoprienko - Effective Search Engine Reputation Management StrategiesVictoria Onoprienko - Effective Search Engine Reputation Management Strategies
Victoria Onoprienko - Effective Search Engine Reputation Management StrategiesNetpeak
 
Dynamic ads and online superstores teaching yandex.direct to choose efficient...
Dynamic ads and online superstores teaching yandex.direct to choose efficient...Dynamic ads and online superstores teaching yandex.direct to choose efficient...
Dynamic ads and online superstores teaching yandex.direct to choose efficient...Yandex-adv-en
 
KB Seminars: Working with Technology - Product Management; 10/13
KB Seminars: Working with Technology - Product Management; 10/13KB Seminars: Working with Technology - Product Management; 10/13
KB Seminars: Working with Technology - Product Management; 10/13MDIF
 
Competitive Benchmarks_Approach
Competitive Benchmarks_ApproachCompetitive Benchmarks_Approach
Competitive Benchmarks_ApproachSanjay Mitra
 
Uрtoрromo
UрtoрromoUрtoрromo
Uрtoрromoforseman
 
How Many Columns Should I Use? How using the best page layout led to a 681% r...
How Many Columns Should I Use? How using the best page layout led to a 681% r...How Many Columns Should I Use? How using the best page layout led to a 681% r...
How Many Columns Should I Use? How using the best page layout led to a 681% r...MarketingExperiments
 
Data analytics and SEO to grow your international business
Data analytics and SEO to grow your international businessData analytics and SEO to grow your international business
Data analytics and SEO to grow your international businessEnterprise Ireland
 
Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Reportweichen
 
CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsDataBench
 
CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core Operationst_ivanov
 
ME - How Many Columns Should I Use?
ME - How Many Columns Should I Use?ME - How Many Columns Should I Use?
ME - How Many Columns Should I Use?Định Lê
 
Team project - Data visualization on Olist company data
Team project - Data visualization on Olist company dataTeam project - Data visualization on Olist company data
Team project - Data visualization on Olist company dataManasa Damera
 
Benchmarking
BenchmarkingBenchmarking
BenchmarkingCIToolkit
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
 

Similar to George Moiseev - Classification of E-commerce Websites by Product Categories (20)

RankTank tutorial (ranktank.eu/)
RankTank tutorial (ranktank.eu/)RankTank tutorial (ranktank.eu/)
RankTank tutorial (ranktank.eu/)
 
2016 siteIQ Website Evaluation Services Brochure
2016 siteIQ Website Evaluation Services Brochure2016 siteIQ Website Evaluation Services Brochure
2016 siteIQ Website Evaluation Services Brochure
 
Victoria Onoprienko - Effective Search Engine Reputation Management Strategies
Victoria Onoprienko - Effective Search Engine Reputation Management StrategiesVictoria Onoprienko - Effective Search Engine Reputation Management Strategies
Victoria Onoprienko - Effective Search Engine Reputation Management Strategies
 
Dynamic ads and online superstores teaching yandex.direct to choose efficient...
Dynamic ads and online superstores teaching yandex.direct to choose efficient...Dynamic ads and online superstores teaching yandex.direct to choose efficient...
Dynamic ads and online superstores teaching yandex.direct to choose efficient...
 
KB Seminars: Working with Technology - Product Management; 10/13
KB Seminars: Working with Technology - Product Management; 10/13KB Seminars: Working with Technology - Product Management; 10/13
KB Seminars: Working with Technology - Product Management; 10/13
 
Competitive Benchmarks_Approach
Competitive Benchmarks_ApproachCompetitive Benchmarks_Approach
Competitive Benchmarks_Approach
 
Uрtoрromo
UрtoрromoUрtoрromo
Uрtoрromo
 
How Many Columns Should I Use? How using the best page layout led to a 681% r...
How Many Columns Should I Use? How using the best page layout led to a 681% r...How Many Columns Should I Use? How using the best page layout led to a 681% r...
How Many Columns Should I Use? How using the best page layout led to a 681% r...
 
Data analytics and SEO to grow your international business
Data analytics and SEO to grow your international businessData analytics and SEO to grow your international business
Data analytics and SEO to grow your international business
 
Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Report
 
CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core Operations
 
CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core Operations
 
ME - How Many Columns Should I Use?
ME - How Many Columns Should I Use?ME - How Many Columns Should I Use?
ME - How Many Columns Should I Use?
 
The power of BI
The power of BIThe power of BI
The power of BI
 
Project+team+1 slides (2)
Project+team+1 slides (2)Project+team+1 slides (2)
Project+team+1 slides (2)
 
Team project - Data visualization on Olist company data
Team project - Data visualization on Olist company dataTeam project - Data visualization on Olist company data
Team project - Data visualization on Olist company data
 
Benchmarking
BenchmarkingBenchmarking
Benchmarking
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
 
Seo Presentation for Beginners, Complete SEO ppt,
Seo Presentation for Beginners, Complete SEO ppt,Seo Presentation for Beginners, Complete SEO ppt,
Seo Presentation for Beginners, Complete SEO ppt,
 
Seo
SeoSeo
Seo
 

More from AIST

Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныAIST
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...AIST
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискAIST
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...AIST
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...AIST
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...AIST
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAAIST
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeAIST
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceAIST
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...AIST
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...AIST
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...AIST
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumAIST
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...AIST
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...AIST
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingAIST
 
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...AIST
 
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...AIST
 
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...AIST
 

More from AIST (20)

Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBA
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
 
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
 
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
Andrey Mukhtarov - The Study of Applicability of the Decision Tree Method for...
 
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
Oxana Logunova - The Results Of Sulfur Print Image Classification Of Section ...
 

Recently uploaded

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

George Moiseev - Classification of E-commerce Websites by Product Categories

  • 1. Classification of E-commerce Websites by Product Categories Case Study Moiseev George Higher School of Economics Faculty of Computer Science Higher School of Economics , Moscow, 2016 www.hse.ru
  • 2. Outline • Introduction • Preprocessing • Feature extraction • Classification and evaluation • Experimental results 2
  • 3. Problem Statement • Retrieve e-commerce websites (e-shops) • Classify e-shops by sold product type *We don’t include customer-to-customer websites as e- commerce shops 3
  • 4. Applications • Market research • Statistics gathering • Organizing a knowledge base • Goods search 4
  • 5. Dataset The dataset was received by datainsight.ru There are two training subsets marked by experts: 1. 1312 e-commerce and 1077 non e-commerce web sites 2. 1448 of 15 product categories. 5
  • 6. Preprocessing Downloading a website: Starting from the main page Download all internal hyperlinks from a web page which weren’t downloaded before Check if equal webpage was already downloaded by other hyperlink What information should be saved from other webpages: 1. Nothing 2. Only meta data 3. Everything 6
  • 7. Preprocessing Each webpage will be stored in two versions • Raw page: – Remove only javascript and obvious advertisements • Cleaned page: – Extract only content of markup tags – Tokenization – retrieving sentences and words – Stemming – reducing words to their root or base form – Lowercase conversion – Filter out stopwords 7
  • 8. Feature Extraction There many methods and models for automatic text feature extraction: • Bag of words • n-grams • word2vec • TF-IDF (on the picture) • Mutual information • Chi-square • … 8
  • 9. Feature extraction Proposed approach: The term weighting formula for the i-th term in the k-th website is derived from TF-IDF as follows: 𝑊𝑖𝑘 = 𝑡𝑓𝑖𝑘 log 𝑁 𝑛𝑖 (𝑡𝑓𝑖𝑗 log 𝑁 𝑛𝑗 )2𝑁 𝑗=1 where ni is the number of websites where the i-th term appears, N – total number of web sites in the sample and tfik is computed as: 𝑡𝑓𝑖𝑘 = 𝑤(𝑡)f(𝑖, 𝑘, 𝑡) 𝑇 𝑡 Where w(t) is inversely proportional frequency of a tag t, f(i, k, t) is frequency of the i-th term in t-th tag. 9
  • 10. Classification and evaluation • Support Vector Machine as classifier. • multiclass classification performs in “one-vs-all” way. • precision, recall and F-score for evaluation • overall performance of the product type classification is evaluated by average F-score among all categories. 10
  • 11. Results F-score of e-commerce class in binary classification 11 Used web site information pure TF-IDF TF-IDF with Tag weighting only main page 0.85 0.89 main page + meta and title from other pages 0.89 0.94 main page + whole other pages 0.86 0.92 .
  • 12. Results average F-score of e-commerce categorization by sold product type: 12 . Used web site information pure TF-IDF TF-IDF with Tag Weighting only main page 0.67 0.72 main page + meta and title from other pages 0.74 0.79 main page + whole other pages 0.73 0.81
  • 13. References 1. A. Rahmani and S. Meshkizadeh, "Webpage Classification based on Compound of Using HTML Features & URL Features and Features of Sibling Pages", International Journal of Advancements in Computing Technology, vol. 2, no. 4, pp. 36-46, 2010. 2. A. Aizawa, "An information-theoretic perspective of tf-idf measures", Information Processing & Management, vol. 39, no. 1, pp. 45-65, 2003. 3. D. Powers, "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation", Journal of Machine Learning Technologies, vol. 1, no. 2, pp. 37-63, 2011. 4. Vapnik, V., Cortez, C.: Support vector networks. Machine Learning. (1995). 5. Ghani, R., Slattery, S., Yang, Y.: Hypertext categorization using hyperlink patterns and meta data. ICML 01: Proceedings of the Eighteenth International Conference on Machine Learning. 178-185 (2001). 13 .