Amadeus Magrabi
@amadeusmagrabi
BOOSTING PRODUCT CATEGORIZATION

WITH MACHINE LEARNING
…
11/2017 2
Company:
Customers: People who want to sell something online
…
Main product: REST API to manage online shops
11/2017 3
User Interface
Company:
11/2017 4
Goal: Use machine learning to automatically
recommend categories for products
Machine Learning for Category Recommendations
Fashion
Men Women
Sports
Shoes Pants
Business
11/2017
5
Challenge: Every online store has a different category structure.
Challenges
Fashion
Men Women
Jeans
Clothing
Pants Shirts Shoes
Store 1 Store 2
11/2017
6
Challenge: Every online store has a different category structure.
Challenges
Fashion
Men Women
Jeans
Clothing
Pants Shirts Shoes
Store 1 Store 2 Store 3
Model 1 Model 2 Model 3
predictpredictpredict
Option 1: multiple store-specific models
Store 1 Store 2
11/2017
7
Challenge: Every online store has a different category structure.
Challenges
Fashion
Men Women
Jeans
Clothing
Pants Shirts Shoes
Store 1 Store 2 Store 3
Model 1 Model 2 Model 3
predictpredictpredict
General Categories
Model 1
predict
match
Store 3Store 2Store 1
match
match
Option 1: multiple store-specific models
Option 2: one general model
Store 1 Store 2
11/2017
8
Challenge: Every online store has a different category structure.
Challenges
Fashion
Men Women
Jeans
Clothing
Pants Shirts Shoes
Store 1 Store 2 Store 3
Model 1 Model 2 Model 3
predictpredictpredict
General Categories
Model 1
predict
match
Store 3Store 2Store 1
match
match
Option 1: multiple store-specific models
Option 2: one general model
• Better accuracies for 

stores with very specific

categories
• No category matching 

necessary
• More data-per-model
• More flexible
• Easier to deploy
• Also works for stores

with little data
• Can also recommend 

categories that are not
yet defined in the store
Store 1 Store 2
11/2017 9
Challenge: Product data is diverse and unbalanced, which

complicates feature selection.
Challenges
Approach:
→ Focus on features names, images and descriptions
• carry most information
• available for most products
• Product names
• Images
• Prices
• Descriptions
• Sizes
• Brands
• Colors
• Expiration Dates
• …
11/2017 10
Challenge: Very large class set
• Amazon/Ebay have listed 50000+ categories
• Tradeoff: Coverage vs. Accuracy
Challenges
Approach:

→ select broad model categories

to cover main use cases
→ rely on category matching procedure 

to catch more specialized categories
→ current version has a selection of

723 model categories
11/2017 11
Overview of Approach
723 General Categories
Name Model Description Model
predict
match
predict
Store 3Store 2Store 1
match
match
Image Model
predict
11/2017 12
• Model: Convolutional Neural Network (Deep Learning)
• Similar to mechanisms in the brain: Idea of building complex
representations by combining simple representations
Model for Product Images
11/2017 13
• Model: Convolutional Neural Network (Deep Learning)
• Similar to mechanisms in the brain: Idea of building complex
representations by combining simple representations
• Trained via transfer learning on famous image recognition network
Inception v3 (TensorFlow, Google Cloud ML Engine)
Model for Product Images
11/2017 14
Preprocessing: (spacy, re, gensim, Google Translate, pyenchant)
• spellchecker
• translation
• tokenization
• normalization
• lemmatization
• phrasing
• word removal
Model for Product Names
Examples:
“Mens Heavyweight 6.1-ounce, 100% cotton T-Shirts in Regular, Big and Tall Sizes”
“Gala Apples Fresh Fruit, 3 LB Bag”
“Carhartt Men's Maddock Pocket T-Shirt Size M”
“Samsung SM-G900V - Galaxy S5 - 16GB Android Smartphone Verizon + GSM - Black”
(smartwathc → smartwatch)
(German → English)
(complete names → words)
(lowercasing, deleting special characters)
(apples → apple)
(louis vuitton → louis_vuitton)
(stop words, blacklist)
11/2017 15
Preprocessing: (spacy, re, gensim, Google Translate, pyenchant)
• spellchecker
• translation
• tokenization
• normalization
• lemmatization
• phrasing
• word removal
Model for Product Names
Examples:
“Mens Heavyweight 6.1-ounce, 100% cotton T-Shirts in Regular, Big and Tall Sizes”
“Gala Apples Fresh Fruit, 3 LB Bag”
“Carhartt Men's Maddock Pocket T-Shirt Size M”
“Samsung SM-G900V - Galaxy S5 - 16GB Android Smartphone Verizon + GSM - Black”
(smartwathc → smartwatch)
(German → English)
(complete names → words)
(lowercasing, deleting special characters)
(apples → apple)
(louis vuitton → louis_vuitton)
(stop words, blacklist)
Models: (scikit-learn)
• Logistic Regression
• Naive Bayes
• Random Forest
• XGBoost
• Support Vector Machine
11/2017 16
Vectorization methods: (text → numbers)

bag-of-words:
• Simple approach, but sparse representation and blind to context
Model for Product Names
11/2017 17
Vectorization methods: (text → numbers)

bag-of-words:
• Simple approach, but sparse representation and blind to context
Model for Product Names
tf-idf:
• Similar to bag-of-words, but weighs words higher when they 

do not occur frequently in dataset
• Intuition: “the” has less predictive value than “iPhone”
• TF(w) = (number of times word appears in name) / (total number of words in name)
• IDF(w) = log_e(total number of names / number of names with word w in it)
11/2017 18
Vectorization methods: (text → numbers)

bag-of-words:
• Simple approach, but sparse representation and blind to context
Model for Product Names
word2vec:
• Trains two-layer neural network that predicts

context words of a word
• Results in a dense and context-sensitive 

representation
tf-idf:
• Similar to bag-of-words, but weighs words higher when they 

do not occur frequently in dataset
• Intuition: “the” has less predictive value than “iPhone”
• TF(w) = (number of times word appears in name) / (total number of words in name)
• IDF(w) = log_e(total number of names / number of names with word w in it)
11/2017 19
Model for Product Names 

Model for Product Descriptions
Model for Product Descriptions
11/2017 20
Category Matching
Model categories are matched to store-specific categories via a word2vec model trained on a news dataset
word2vec

similarity
723 General Categories
Name Model Description Model
predict
match
predict
Store 3Store 2Store 1
match
match
Image Model
predictaveraging

class

probabilities
11/2017 21
REST API
General API
11/2017 22
REST API
Store-Specific API
11/2017 23
GUI Integration
11/2017 24
Thank you!
Amadeus Magrabi
@amadeusmagrabi
amadeus.magrabi@commercetools.com
word2vec

similarity
723 General Categories
Name Model Description Model
predict
match
predict
Store 3Store 2Store 1
match
match
Image Model
predictaveraging

class

probabilities
Tensorflow, Inception tf-idf, LogReg tf-idf, LogReg

DN 2017 | Boosting Product Categorization with Machine Learning | Amadeus Magrabi | commercetools

  • 1.
    Amadeus Magrabi @amadeusmagrabi BOOSTING PRODUCTCATEGORIZATION
 WITH MACHINE LEARNING
  • 2.
    … 11/2017 2 Company: Customers: Peoplewho want to sell something online … Main product: REST API to manage online shops
  • 3.
  • 4.
    11/2017 4 Goal: Usemachine learning to automatically recommend categories for products Machine Learning for Category Recommendations Fashion Men Women Sports Shoes Pants Business
  • 5.
    11/2017 5 Challenge: Every onlinestore has a different category structure. Challenges Fashion Men Women Jeans Clothing Pants Shirts Shoes Store 1 Store 2
  • 6.
    11/2017 6 Challenge: Every onlinestore has a different category structure. Challenges Fashion Men Women Jeans Clothing Pants Shirts Shoes Store 1 Store 2 Store 3 Model 1 Model 2 Model 3 predictpredictpredict Option 1: multiple store-specific models Store 1 Store 2
  • 7.
    11/2017 7 Challenge: Every onlinestore has a different category structure. Challenges Fashion Men Women Jeans Clothing Pants Shirts Shoes Store 1 Store 2 Store 3 Model 1 Model 2 Model 3 predictpredictpredict General Categories Model 1 predict match Store 3Store 2Store 1 match match Option 1: multiple store-specific models Option 2: one general model Store 1 Store 2
  • 8.
    11/2017 8 Challenge: Every onlinestore has a different category structure. Challenges Fashion Men Women Jeans Clothing Pants Shirts Shoes Store 1 Store 2 Store 3 Model 1 Model 2 Model 3 predictpredictpredict General Categories Model 1 predict match Store 3Store 2Store 1 match match Option 1: multiple store-specific models Option 2: one general model • Better accuracies for 
 stores with very specific
 categories • No category matching 
 necessary • More data-per-model • More flexible • Easier to deploy • Also works for stores
 with little data • Can also recommend 
 categories that are not yet defined in the store Store 1 Store 2
  • 9.
    11/2017 9 Challenge: Productdata is diverse and unbalanced, which
 complicates feature selection. Challenges Approach: → Focus on features names, images and descriptions • carry most information • available for most products • Product names • Images • Prices • Descriptions • Sizes • Brands • Colors • Expiration Dates • …
  • 10.
    11/2017 10 Challenge: Verylarge class set • Amazon/Ebay have listed 50000+ categories • Tradeoff: Coverage vs. Accuracy Challenges Approach:
 → select broad model categories
 to cover main use cases → rely on category matching procedure 
 to catch more specialized categories → current version has a selection of
 723 model categories
  • 11.
    11/2017 11 Overview ofApproach 723 General Categories Name Model Description Model predict match predict Store 3Store 2Store 1 match match Image Model predict
  • 12.
    11/2017 12 • Model:Convolutional Neural Network (Deep Learning) • Similar to mechanisms in the brain: Idea of building complex representations by combining simple representations Model for Product Images
  • 13.
    11/2017 13 • Model:Convolutional Neural Network (Deep Learning) • Similar to mechanisms in the brain: Idea of building complex representations by combining simple representations • Trained via transfer learning on famous image recognition network Inception v3 (TensorFlow, Google Cloud ML Engine) Model for Product Images
  • 14.
    11/2017 14 Preprocessing: (spacy,re, gensim, Google Translate, pyenchant) • spellchecker • translation • tokenization • normalization • lemmatization • phrasing • word removal Model for Product Names Examples: “Mens Heavyweight 6.1-ounce, 100% cotton T-Shirts in Regular, Big and Tall Sizes” “Gala Apples Fresh Fruit, 3 LB Bag” “Carhartt Men's Maddock Pocket T-Shirt Size M” “Samsung SM-G900V - Galaxy S5 - 16GB Android Smartphone Verizon + GSM - Black” (smartwathc → smartwatch) (German → English) (complete names → words) (lowercasing, deleting special characters) (apples → apple) (louis vuitton → louis_vuitton) (stop words, blacklist)
  • 15.
    11/2017 15 Preprocessing: (spacy,re, gensim, Google Translate, pyenchant) • spellchecker • translation • tokenization • normalization • lemmatization • phrasing • word removal Model for Product Names Examples: “Mens Heavyweight 6.1-ounce, 100% cotton T-Shirts in Regular, Big and Tall Sizes” “Gala Apples Fresh Fruit, 3 LB Bag” “Carhartt Men's Maddock Pocket T-Shirt Size M” “Samsung SM-G900V - Galaxy S5 - 16GB Android Smartphone Verizon + GSM - Black” (smartwathc → smartwatch) (German → English) (complete names → words) (lowercasing, deleting special characters) (apples → apple) (louis vuitton → louis_vuitton) (stop words, blacklist) Models: (scikit-learn) • Logistic Regression • Naive Bayes • Random Forest • XGBoost • Support Vector Machine
  • 16.
    11/2017 16 Vectorization methods:(text → numbers)
 bag-of-words: • Simple approach, but sparse representation and blind to context Model for Product Names
  • 17.
    11/2017 17 Vectorization methods:(text → numbers)
 bag-of-words: • Simple approach, but sparse representation and blind to context Model for Product Names tf-idf: • Similar to bag-of-words, but weighs words higher when they 
 do not occur frequently in dataset • Intuition: “the” has less predictive value than “iPhone” • TF(w) = (number of times word appears in name) / (total number of words in name) • IDF(w) = log_e(total number of names / number of names with word w in it)
  • 18.
    11/2017 18 Vectorization methods:(text → numbers)
 bag-of-words: • Simple approach, but sparse representation and blind to context Model for Product Names word2vec: • Trains two-layer neural network that predicts
 context words of a word • Results in a dense and context-sensitive 
 representation tf-idf: • Similar to bag-of-words, but weighs words higher when they 
 do not occur frequently in dataset • Intuition: “the” has less predictive value than “iPhone” • TF(w) = (number of times word appears in name) / (total number of words in name) • IDF(w) = log_e(total number of names / number of names with word w in it)
  • 19.
    11/2017 19 Model forProduct Names 
 Model for Product Descriptions Model for Product Descriptions
  • 20.
    11/2017 20 Category Matching Modelcategories are matched to store-specific categories via a word2vec model trained on a news dataset word2vec
 similarity 723 General Categories Name Model Description Model predict match predict Store 3Store 2Store 1 match match Image Model predictaveraging
 class
 probabilities
  • 21.
  • 22.
  • 23.
  • 24.
    11/2017 24 Thank you! AmadeusMagrabi @amadeusmagrabi amadeus.magrabi@commercetools.com word2vec
 similarity 723 General Categories Name Model Description Model predict match predict Store 3Store 2Store 1 match match Image Model predictaveraging
 class
 probabilities Tensorflow, Inception tf-idf, LogReg tf-idf, LogReg