Embedding Based Frequently Bought
Together Recommendations:
A Production Use Case
Agenda
Mehmet Selman Sezgin
Senior Data Engineer at Hepsiburada
Ulukbek Attokurov
Data Scientist at Hepsiburada
Content
● Embedding Based Recommendations
● Modeling (Frequently Bought Together)
● Arithmetic operations on Embeddings
● Architecture Overview
● Serving Layer
● Experimental UI
● Online Metrics
● Conclusion
▪ 40+ categories▪ 200M+ visitors
per month
▪ 30M+ products
https://developers.google.com/machine-learning/crash-course/embeddings
An embedding is a relatively low-
dimensional space into which you can
translate high-dimensional vectors.
Co-occurrence vs Embeddings
▪ Uses raw co-occurrence statistics
(Salton89 which is TF-IDF based
metric)
▪ Uses behavior data (product views,
order, add-to-cart)
▪ Generates item based
recommendation
▪ Can not project users and items into
the same space
▪ Uses advanced methods (Resnet,
Inception, VGG, Word2Vec, Doc2Vec,
AutoEncoders , BERT etc.)
▪ Generates item and user based
recommendations
▪ Uses content information as product
image, product description, product
name, product attributes
▪ Image, text, behavior embeddings
can be projected into the same space
EmbeddingsCo-occurrence statistics
Co-occurrence vs Embeddings
▪ Products are not recommended if
they do not appear in the same
context
▪ Context information such as
products appeared in the same
session, transaction etc. is not
employed
▪ Content information (image, text
etc.) is not used
▪ Similarity metrics can be calculated
▪ Use as features in unsupervised and
supervised methods to optimize a
business metric as propensity score.
▪ Use as features in Neural Networks
as LSTM to model behavior of
customers over time.
▪ Use as features in KNN to
recommend the most similar items
EmbeddingsCo-occurrence statistics
Frequently Bought Together
▪ Goal: building recommendations to offer complementary products to
our customers
▪ Challenges:
▪ Orders might contain products from diverse categories
▪ Generating recommendations using 30M+ products distributed over 40+ categories
▪ Tips: bought together does not mean that the items which co-occur in
the sequence are similar
▪ Our model choice: Word2Vec
Word2Vec
▪ Easy to use
▪ Easy to train
▪ Simple format of training samples
▪ User friendly libraries like Gensim
▪ A few parameters to optimize
▪ A lot of practical use cases
Data Preparation
▪ Sentence
▪ Bag-of-Words
▪ “I am attending a conference”
▪ [“I”, “attending”, “conference”]
▪ User behavior (views, purchases etc.)
▪ Set of purchased items
▪ Orders: Keyboard, Computer, Mouse
▪ [“Keyboard”, “Computer”, “Mouse”]
Frequently Bought TogetherNLP
Data Preparation - Context Separation
▪ Sequences may contain the products
from diverse categories
▪ [“Keyboard”, “Mouse”, “Shoes”, “Socks”]
▪ Sub-sequences may be created depending
on labels as category, brand etc.
▪ [“Keyboard”, “Mouse”] and [“Shoes”,Socks”]
Sub-sequenceSequence
Code Sample for Data Preparation
Word2Vec Parameters
▪ Random Search is applied to restrict a parameter search space
▪ Grid Search is applied to select optimal parameters
▪ Following Word2Vec parameters are optimized
▪ min_count: it is preferable to set lower otherwise coverage will decrease
▪ sample: the most frequent items dominates sequences; it might yield noisy embeddings;
computationally not efficient.
▪ window: the length of context is set to be the maximum length of sequences since order of
items in the sequence is random.
▪ size: tradeoff between network size, storage and computational cost; it is set to be as minimum
as possible without losing the quality of recommendations
▪ iter: default value is very low and thus it is set to be between 50 and 80; model is not trained
well when iter is set to low values;
▪ KNN algorithm is employed to find
the most similar items
▪ Different similarity metrics are used :
Euclidean, Cosine Similarity
▪ Euclidean distance measures the distance between two points and it
is affected by the length of vectors. Thus, it is needed to normalize
vectors in order to obtain more accurate results.
▪ Angle between two vectors determine the similarity of two vectors in
cosine similarity.
Similarity Functions
Offline Metrics
▪ We need simple statistical metrics to be able to check the
performance of the model and to tune parameters
▪ Precision@k
▪ (# of recommended items @k that are relevant) / (# of recommended items @k)
▪ Recall@k
▪ (# of recommended items @k that are relevant) / (total # of relevant items)
▪ HitRate@k
▪ (# of hits @k recommendations ) / (total # of test users)
MLFlow Tracks the Model
▪ It is easy visually inspect the
parameters
▪ Evaluation metrics can be
investigated graphically
▪ It is easy to integrate into the
source code
▪ It is effective for team
collaboration through the
central server
Word2Vec Hyperparameter Tuning
Word2Vec Hyperparameter Tuning
Arithmetic Operations on Embeddings
▪ Is it possible to create new business dimensions using simple
arithmetics on existing product embeddings?
▪ Similarity( AVG(Adidas_Shoes) , AVG(Nike_Shoes)) ≃ 1 ?
▪ Similarity( AVG(Camping tents) , AVG(Outdoor chairs)) ≃ 1 ?
▪ 1_Adidas_Shoe - Adidas_Brand + Nike_Brand ≃ 1_Similar_Nike_Shoe ?
▪ Relevancy is decreasing while entities in higher levels of hierarchy as
categories(Sport, Baby, Women Clothes etc.) are represented using
low level entities as products.
Arithmetic Operations on Embeddings
▪ Brand similarity is relevant if a brand contains homogeneous products
in terms of categories(Upper body clothes, Lower body clothes etc.) .
Architecture Overview
Implementation Tips
▪ PySpark
▪ Enables to work with any python modelling library through spark to pandas dataframe conversion
▪ Pandas UDFs are very useful for parallelization
▪ Conversion from Spark DF to Pandas DF is still costly in terms of memory in spite of using Arrow
Implementation Tips
▪ Model Quality
▪ Offline metrics, experimental UI and online metrics should be used for quality analysis
▪ Process
▪ Useful to use notebooks in experimental stage but it is preferable not to use in production
▪ Transition from experimental stage to production should have minimum cost
▪ Metric Validation should be a part of the flow, not a background analysis in production phase
Model Serving Layer
▪ Approximate Nearest Neighbour
Search Algorithms
▪ Annoy, Faiss, Hnswlib, ScaNN and many others
▪ Choose the library considering
▪ Open source benchmarks
▪ Programming language
▪ Similarity functions
▪ Distributed Index
▪ Incremental item insertion / deletion
▪ Ability to customize
▪ Our choice
▪ Hnswlib + Custom Post-Processing Layer
http://ann-benchmarks.com/
Model Serving Layer - HNSWLIB
▪ Trade-off between hierarchical navigable small world graph
construction and search parameters
▪ Simple tree, weak search: less indexing time, less memory, less cpu usage, low recall
▪ Simple tree, strong search: less indexing time, less memory, more cpu usage, acceptable recall
▪ Complex tree, weak search: more indexing time, more memory, less cpu usage, high recall
▪ Complex tree, complex search: more indexing time, more memory, high cpu usage (waste), high recall
▪ Consider the following metrics to select optimal parameters
▪ Index size / Memory consumption
▪ Build time
▪ Cpu usage
▪ Query per seconds
▪ Recall
Model Serving Layer - Post Processing
▪ Only similarity search will not be enough
▪ You will need to make some post-processing after retrieving result
▪ Implement your custom solution
▪ Make post-processing in the consuming service
▪ Use metadata and post-process supporting solution
▪ ex: opendistro-for-elasticsearch which supports hnswlib index and brings post-processing
functions
▪ Every solution has it’s own pros. and cons. We implemented our
custom solution which enhances the index with metadata and you can
inject any filtering or ranking methods that you need.
Post Filtering Validation Methods
Experimental UI
▪ Reveal what you need
▪ Variant level exclusions
▪ Category level restrictions and exclusions
▪ Brand level restrictions and exclusions
▪ Price aware filters
▪ Gender filters
▪ Top-N category diverse ranking
▪ Etc.
▪ Implement in serving layer
▪ Experiment again
Model Serving Layer - Performance
▪ Single instance
▪ 8K request per second
▪ Under 1ms (~400µs)
▪ Using assembly code
instead of default
distance function
implementations may
improve indexing and
query performance
considerably
(vectorization)
Model Serving Layer - Results on Production
Two FBT Examples on Production (Shown after add to cart action)
Online Metrics
CTR
CR
Coverage
Diversity
Revenue
Usage Ratio
Order Ratio
▪ Placement Title
Placement Location
Position in Placement
Category Levels
Channel
Time of Week/Day
Gender
DimensionsKey Metrics
Online Metrics
▪ Calculate your overall impact
▪ Make your detailed analysis to increase domain knowledge which leads
to improvement of your recommendations
▪ If you only rely on CTR and CR you may lose the big picture
▪ Popular products and their relatively higher CTRs may put you in a
vicious circle in a narrow space.
▪ You should interpret CR metric differently for different categories.
Take Aways
▪ Use embedding representations in recommendation domain as much
as possible
▪ Word2Vec is easy to use and train (without using GPUs) but tune
parameters wisely and asses offline metrics taking into account your
business requirements.
▪ Be careful when applying arithmetic operations on embeddings
▪ Follow small cycles during the experimental and production stages
▪ Design serving layer considering your scale
▪ Use experimental UI and apply post-filtering for more relevant results
▪ Track online metrics to understand real impact of your solution

Frequently Bought Together Recommendations Based on Embeddings

  • 1.
    Embedding Based FrequentlyBought Together Recommendations: A Production Use Case
  • 2.
    Agenda Mehmet Selman Sezgin SeniorData Engineer at Hepsiburada Ulukbek Attokurov Data Scientist at Hepsiburada
  • 3.
    Content ● Embedding BasedRecommendations ● Modeling (Frequently Bought Together) ● Arithmetic operations on Embeddings ● Architecture Overview ● Serving Layer ● Experimental UI ● Online Metrics ● Conclusion
  • 4.
    ▪ 40+ categories▪200M+ visitors per month ▪ 30M+ products
  • 5.
    https://developers.google.com/machine-learning/crash-course/embeddings An embedding isa relatively low- dimensional space into which you can translate high-dimensional vectors.
  • 6.
    Co-occurrence vs Embeddings ▪Uses raw co-occurrence statistics (Salton89 which is TF-IDF based metric) ▪ Uses behavior data (product views, order, add-to-cart) ▪ Generates item based recommendation ▪ Can not project users and items into the same space ▪ Uses advanced methods (Resnet, Inception, VGG, Word2Vec, Doc2Vec, AutoEncoders , BERT etc.) ▪ Generates item and user based recommendations ▪ Uses content information as product image, product description, product name, product attributes ▪ Image, text, behavior embeddings can be projected into the same space EmbeddingsCo-occurrence statistics
  • 7.
    Co-occurrence vs Embeddings ▪Products are not recommended if they do not appear in the same context ▪ Context information such as products appeared in the same session, transaction etc. is not employed ▪ Content information (image, text etc.) is not used ▪ Similarity metrics can be calculated ▪ Use as features in unsupervised and supervised methods to optimize a business metric as propensity score. ▪ Use as features in Neural Networks as LSTM to model behavior of customers over time. ▪ Use as features in KNN to recommend the most similar items EmbeddingsCo-occurrence statistics
  • 8.
    Frequently Bought Together ▪Goal: building recommendations to offer complementary products to our customers ▪ Challenges: ▪ Orders might contain products from diverse categories ▪ Generating recommendations using 30M+ products distributed over 40+ categories ▪ Tips: bought together does not mean that the items which co-occur in the sequence are similar ▪ Our model choice: Word2Vec
  • 9.
    Word2Vec ▪ Easy touse ▪ Easy to train ▪ Simple format of training samples ▪ User friendly libraries like Gensim ▪ A few parameters to optimize ▪ A lot of practical use cases
  • 10.
    Data Preparation ▪ Sentence ▪Bag-of-Words ▪ “I am attending a conference” ▪ [“I”, “attending”, “conference”] ▪ User behavior (views, purchases etc.) ▪ Set of purchased items ▪ Orders: Keyboard, Computer, Mouse ▪ [“Keyboard”, “Computer”, “Mouse”] Frequently Bought TogetherNLP
  • 11.
    Data Preparation -Context Separation ▪ Sequences may contain the products from diverse categories ▪ [“Keyboard”, “Mouse”, “Shoes”, “Socks”] ▪ Sub-sequences may be created depending on labels as category, brand etc. ▪ [“Keyboard”, “Mouse”] and [“Shoes”,Socks”] Sub-sequenceSequence
  • 12.
    Code Sample forData Preparation
  • 13.
    Word2Vec Parameters ▪ RandomSearch is applied to restrict a parameter search space ▪ Grid Search is applied to select optimal parameters ▪ Following Word2Vec parameters are optimized ▪ min_count: it is preferable to set lower otherwise coverage will decrease ▪ sample: the most frequent items dominates sequences; it might yield noisy embeddings; computationally not efficient. ▪ window: the length of context is set to be the maximum length of sequences since order of items in the sequence is random. ▪ size: tradeoff between network size, storage and computational cost; it is set to be as minimum as possible without losing the quality of recommendations ▪ iter: default value is very low and thus it is set to be between 50 and 80; model is not trained well when iter is set to low values;
  • 14.
    ▪ KNN algorithmis employed to find the most similar items ▪ Different similarity metrics are used : Euclidean, Cosine Similarity ▪ Euclidean distance measures the distance between two points and it is affected by the length of vectors. Thus, it is needed to normalize vectors in order to obtain more accurate results. ▪ Angle between two vectors determine the similarity of two vectors in cosine similarity. Similarity Functions
  • 15.
    Offline Metrics ▪ Weneed simple statistical metrics to be able to check the performance of the model and to tune parameters ▪ Precision@k ▪ (# of recommended items @k that are relevant) / (# of recommended items @k) ▪ Recall@k ▪ (# of recommended items @k that are relevant) / (total # of relevant items) ▪ HitRate@k ▪ (# of hits @k recommendations ) / (total # of test users)
  • 16.
    MLFlow Tracks theModel ▪ It is easy visually inspect the parameters ▪ Evaluation metrics can be investigated graphically ▪ It is easy to integrate into the source code ▪ It is effective for team collaboration through the central server
  • 17.
  • 18.
  • 19.
    Arithmetic Operations onEmbeddings ▪ Is it possible to create new business dimensions using simple arithmetics on existing product embeddings? ▪ Similarity( AVG(Adidas_Shoes) , AVG(Nike_Shoes)) ≃ 1 ? ▪ Similarity( AVG(Camping tents) , AVG(Outdoor chairs)) ≃ 1 ? ▪ 1_Adidas_Shoe - Adidas_Brand + Nike_Brand ≃ 1_Similar_Nike_Shoe ? ▪ Relevancy is decreasing while entities in higher levels of hierarchy as categories(Sport, Baby, Women Clothes etc.) are represented using low level entities as products.
  • 20.
    Arithmetic Operations onEmbeddings ▪ Brand similarity is relevant if a brand contains homogeneous products in terms of categories(Upper body clothes, Lower body clothes etc.) .
  • 21.
  • 22.
    Implementation Tips ▪ PySpark ▪Enables to work with any python modelling library through spark to pandas dataframe conversion ▪ Pandas UDFs are very useful for parallelization ▪ Conversion from Spark DF to Pandas DF is still costly in terms of memory in spite of using Arrow
  • 23.
    Implementation Tips ▪ ModelQuality ▪ Offline metrics, experimental UI and online metrics should be used for quality analysis ▪ Process ▪ Useful to use notebooks in experimental stage but it is preferable not to use in production ▪ Transition from experimental stage to production should have minimum cost ▪ Metric Validation should be a part of the flow, not a background analysis in production phase
  • 24.
    Model Serving Layer ▪Approximate Nearest Neighbour Search Algorithms ▪ Annoy, Faiss, Hnswlib, ScaNN and many others ▪ Choose the library considering ▪ Open source benchmarks ▪ Programming language ▪ Similarity functions ▪ Distributed Index ▪ Incremental item insertion / deletion ▪ Ability to customize ▪ Our choice ▪ Hnswlib + Custom Post-Processing Layer http://ann-benchmarks.com/
  • 25.
    Model Serving Layer- HNSWLIB ▪ Trade-off between hierarchical navigable small world graph construction and search parameters ▪ Simple tree, weak search: less indexing time, less memory, less cpu usage, low recall ▪ Simple tree, strong search: less indexing time, less memory, more cpu usage, acceptable recall ▪ Complex tree, weak search: more indexing time, more memory, less cpu usage, high recall ▪ Complex tree, complex search: more indexing time, more memory, high cpu usage (waste), high recall ▪ Consider the following metrics to select optimal parameters ▪ Index size / Memory consumption ▪ Build time ▪ Cpu usage ▪ Query per seconds ▪ Recall
  • 26.
    Model Serving Layer- Post Processing ▪ Only similarity search will not be enough ▪ You will need to make some post-processing after retrieving result ▪ Implement your custom solution ▪ Make post-processing in the consuming service ▪ Use metadata and post-process supporting solution ▪ ex: opendistro-for-elasticsearch which supports hnswlib index and brings post-processing functions ▪ Every solution has it’s own pros. and cons. We implemented our custom solution which enhances the index with metadata and you can inject any filtering or ranking methods that you need.
  • 27.
  • 28.
    Experimental UI ▪ Revealwhat you need ▪ Variant level exclusions ▪ Category level restrictions and exclusions ▪ Brand level restrictions and exclusions ▪ Price aware filters ▪ Gender filters ▪ Top-N category diverse ranking ▪ Etc. ▪ Implement in serving layer ▪ Experiment again
  • 29.
    Model Serving Layer- Performance ▪ Single instance ▪ 8K request per second ▪ Under 1ms (~400µs) ▪ Using assembly code instead of default distance function implementations may improve indexing and query performance considerably (vectorization)
  • 30.
    Model Serving Layer- Results on Production Two FBT Examples on Production (Shown after add to cart action)
  • 31.
    Online Metrics CTR CR Coverage Diversity Revenue Usage Ratio OrderRatio ▪ Placement Title Placement Location Position in Placement Category Levels Channel Time of Week/Day Gender DimensionsKey Metrics
  • 32.
    Online Metrics ▪ Calculateyour overall impact ▪ Make your detailed analysis to increase domain knowledge which leads to improvement of your recommendations ▪ If you only rely on CTR and CR you may lose the big picture ▪ Popular products and their relatively higher CTRs may put you in a vicious circle in a narrow space. ▪ You should interpret CR metric differently for different categories.
  • 33.
    Take Aways ▪ Useembedding representations in recommendation domain as much as possible ▪ Word2Vec is easy to use and train (without using GPUs) but tune parameters wisely and asses offline metrics taking into account your business requirements. ▪ Be careful when applying arithmetic operations on embeddings ▪ Follow small cycles during the experimental and production stages ▪ Design serving layer considering your scale ▪ Use experimental UI and apply post-filtering for more relevant results ▪ Track online metrics to understand real impact of your solution