SlideShare a Scribd company logo
1 of 60
Download to read offline
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#1 What is Skroutz.gr?
• Skroutz.gr is a marketplace & shopping assistant which
makes online shopping easier and more reliable
• It includes more than 11,000,000 products from 3,200
different e-shops
• On a monthly basis the website welcomes more than 8
million unique visitors ranking in the top positions in the
Greek Web
#1 Some Numbers
3,200
merchants
11
million products
270 mil.
pageviews /
mo
1.1 mil.
searches/day
33 mil.
sessions/mo
#1 The Problem
• Each day we collect thousands of new products by
downloading e-shop feeds (XML, CSV etc. - product
catalogs)
• We want to categorize incoming product payloads as
provided by eshops to the most relevant categories in
Skroutz category tree taxonomy with the minimum human
intervention.
- Difficult
- Important
#1 Why Difficult?
• Many leaf categories in
Skroutz taxonomy (>2k)
• Sibling categories
(subjective categorization)
• Misleading product titles
and shop-categories from
shops
#1 Why Important?
Robot MO collects
products from shop
feeds and stores them
to DB
Megatron category
classifier categorizes
products to the correct
category
Tron groups similar
products to entities
called SKUs to be
ready for indexing
Elasticsearch indexes
products to be
searchable from user
interface
#1 Facts
•Merchants send more than ~15k new products every day in
Skroutz!!!
•2.3k unique leaf categories in our category tree (taxonomy)
•Manual “move-to-category” action:
- Costs ~7.8s on average for content managers
- Subjective decisions may add extra overhead
#1 Old Solution - Overview
•Use Elasticseach to match specific product attributes:
- PN (manufacturer part number)
- Name
- Shop-category
•Aggregate matches and group by categories
•Normalize results and use custom weights to calculate a score
•Take Top-K results
#1 Old Solution - Limitations
•Plain cosine similarity distance on TF/IDF weights:
- No learning feedback loop
- No advanced statistics utilization (e.g. correlation between price
value and text features)
•No easy way to tune custom weights applied on final scoring
•Heuristics don’t take into account category specific context
•Heuristics don’t take into account word level context. E.g.
word “samsung” is followed by word “galaxy” most of the time
and then probably follows a model number.
#1 Old Solution - Good Parts
•Simple solution (except for custom scoring stuff)
•Easy to debug
•Easy to deploy
•Online
#1 New Solution - “Megatron”
#1 Overview
•Approach problem as a supervised learning task
•Rely on probabilities to obtain a meaningful score
•Use more features from multiple sources and use datasets
•Learn new patterns and relations by training
•Measure performance on dataset splits
•Use a microservice to serve classification requests
•Apply threshold for low confidence results
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#2 Service Architecture
1.Training Phase
2.Inference Phase
3.APIs
#2.1 Training Phase
1. Export dataset (product features labeled with category_id) and upload to Swift
2. Download specific dataset version in “training VM”
3. Start a training session using a train/val split from dataset
4. Save best performing model params snapshot (based on validation set loss)
5. Compress and upload model params to Swift container
#2.2 Inference Phase
1. Application Part: Send classification request
upon new product arrivals:
- Kafka producer (asynchronous request)
- Megatron Client HTTP synchronous
requests (2nd alternative)
2. Category Classifier Microservice Part:
- Pop messages from stream (Kafka
consumer)
- Dispatch messages to in-memory Neural
Network instance
- Fetch predictions (scores) and post-back
to Core Application API endpoint
#2.3 APIs
1. Megatron microservice internal API
- Common API (wraps Keras API)
- Basic methods:
✓ build
✓ train
✓ save
✓ load
✓ predict
- CLI commands
#2.3 APIs(2)
1. Skroutz Application Ecosystem (Ruby client)
- Megatron::Client
✓ Issues requests to microservice
- Megatron DB model
✓ Stores prediction results
- ApiController endpoint
✓ Receives callbacks from microservice
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#3 Data
•Product attribute values (potential features)
Product
Name
Shop manufacturer
Part number
EAN
Price
Shop category
...
Samsung TV 32'' DF324 (PNDFD22) Full HD Black NEW
Αρχική > Ηλεκτρονικά > Τηλεοράσεις
PNDFD22
300 €
#3 Data(2)
•Training Dataset - Raw Features
Image
Numerical
Categorical
Label
Text
#3 Data(3)
•Preprocessing
- Text
- Numerical
- Categorical
- Labels
X
y
#3 Preprocessing - Text
• Our best solution involves “Word Vectors”
• Steps to prepare for word vectors:
- Learn a words Vocabulary (mapping of words to numeric id)
- Transform text sentences to Sequences of ids based on Vocabulary
- Decide a representative sequence length (E.g. 60 words)
- Apply zero padding (pre or post) and truncation to maintain a fixed length
#3 Preprocessing - Text(2)
• Use of Pretrained Embeddings (see W2Vec, FastText, GloVe etc.)
• We use FastText library with skipgram algorithm (unsupervised)
- https://fasttext.cc/docs/en/unsupervised-tutorial.html
#3 Preprocessing - Text(3)
• Embeddings:
- Outputs 100 dim Vector
- Total 1,500,000 rows (vocab)
• 2 versions (Name, Shop-category)
#3 Preprocessing - Numerical
• “Pricevat” and “Name Length” values
• Apply Standard Scaling
#3 Preprocessing - Categorical
• All discrete value attributes/features:
- shop_id
- matching Product PNs category_id list
• One-Hot encoding:
#3 Label Encoding
• “category_id “ values are the “true” labels which should be learned by NN
• One-Hot encoding
• OR just use IDs and rely to “Keras” conventions (E.g. use an internal sparse categorical
representation to save huge amounts of RAM)
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#4 Training
1.Basic Concepts
2.Model Architecture
3.Training “In Action”
#4.1 Basic Concepts
•Objective:
- Find a combination of mathematical functions and a set of
corresponding params to maximize prediction accuracy (or minimize
error rate).
- Ensure that the above generalizes well for production.
- Learn params in an acceptable time window.
•Experiment with Neural Network architectures
•GPUS to the rescue (speedup x10)
#4.1 Basic Concepts(2)
•Loss function
- Categorical Crossentropy
•Optimizer
- Adam (Gradient Descent)
•Hyper-params
- Mini-Batch Size
- Learning Rate
- Epochs
#4.1 Validation
•Why?
- Simulate unseen data
- Compare different:
✓ training methods
✓ hyper -params
- Avoid Overfitting
•Should be representative
•Validation Strategy
- 10% of whole Dataset
- Stratification on Categories
#4.2 Model Architecture
Text
#4.2 Model Architecture
•Hybrid End-to-End architecture
•4 branches (4 input vectors):
A. Name Features Branch
B. Shop-Category Features Branch
C. Basic Features Branch (Numerics, Categorical)
D. Matching PNs Branch (Categorical)
Text
#4.2 Text Branches
• Inspired by “Embed, Encode, Attend, Predict”
- https://explosion.ai/blog/deep-learning-formula-nlp
• Each of “name” and “shop-category” sequence flows through:
- 1 x Embeddings Layer
- 1 x Bi-LSTM Encoder
- 1 x Attention Module
- 1 x LSTM Encoder
#4.2 Text Branches - Why LSTM?
• LSTM stands for “Long Short Term Memory” Layer (Encoder):
- Memory Cells / Captures context
- Propagates signal from previous words to the next in a Sequence
- 2 Stacked Layers performed better in our experiments
- 128 dimension output vector
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
128dim
#cells = sequence length
#4.2 Text Branches - Why pay Attention?
• Attention Mechanism:
- Controll how much signal should be propagated to next layers
- https://distill.pub/2016/augmented-rnns/
#4.2 Other Branches
• Basic Features Branch
- Inputs a concatenation of basic feats
- 1 Dense layer with #classes output
- ReLU activation
• Matching PNs Branch
- Inputs a concatenation of PN feats
- Short-circuited to final layer
InputVector
#classes
(~2kforSkroutz)
#4.2 Final Layer
• Merging Layer
- Concatenates all 4 branches outputs
- softmax activation
- Output: probabilities for each class
#4.2
#4.2 Model Architecture
• Model Capacity/Complexity:
#4.3 Training In Action - Model Selection
•Conducted 100s of experiments with different combinations
of features, layers, modules (e.g. Embeddings, Bag of Words,
TF/IDF, LSTM, etc.)
•10s of Ablations studies: remove specific features to see how
performance is affected
•Read many papers and applied some common tricks (Bi-LSTM,
AdaptivePooling etc.)
•It is an alchemy!
#4.3 Training In Action - Tools
•Training Scheduler Process runs weekly
•CLI training commands
- CUDA_VISIBLE_DEVICES=1 python -m category_classifier.cli scrooge --model end2end --train --epochs
8 --batch_size 128
•Model Versioning
- E.g. “skroutz_models_2018_09_01_v1.tar.gz”
#4.3 Training In Action
Training run output example:
GPU monitoring:
#4.3 Training In Action
Learning Curves (Tensorboard):
Current best
Previous Arch Current bestPrevious Arch
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#5 Inference
1.Inference Pipeline
2.Inference API
3.Production
#5.1 Inference Pipeline
•Online execution:
- preprocessing
- vectorization
- Prediction
•Utilized by CategoryClassifier Class
- Wrapper of external API
•Utilize scikit-learn Pipelines
- http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
#5.2 Inference API
• REPL
• Kafka Worker
• Flask App
#5.3 Production
•x2 inference VMs
- inference1.skroutz.gr, inference2.skroutz.gr (Kafka Workers)
•x2 Flavors (Greece, UK)
•Grafana Monitoring for Kafka Part
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#6 Evaluation
•More than 6% error rate reduction overall in Skroutz!
•Currently, more than ~2 content-editor hours saved per day in
Skroutz (this is scaling)!
•Move operations from list with “uncategorized” products
reduced significantly (by an order of magnitude)!
#6 Performance Summary
Success Rate Failure Rate No Prediction Rate
Megatron Old Megatron Old Megatron Old
Skroutz (GR)
2.3k categories
90.10% 82.6% 7.9% 13.8% 2% 3.5%
91.85% 85.7% 8.14% 14.32% N/A N/A
Scrooge (UK)
350 categories
87.56% 38.9% 2.5% 26.24% 9.9% 58.48%
97.1% 93.67% 2.8% 6.32% N/A N/A
#6 Monitoring Dashboard
#Future Improvements
• Utilize Image Features (in End-To-End model)
• Utilize Entity Recognition to extract more features
• Find ways to utilize more features (color, sizes etc.)
• Categorical Self-Trained Embeddings
• Experiment with newer solutions like “Transformer”
#Contact Info
Andreas Loupasakis
• Email: alup@skroutz.gr
• Kaggle: https://www.kaggle.com/andreaslup
• Twitter: https://twitter.com/andy_lupo
• LinkedIn: https://www.linkedin.com/in/andreas-loupasakis-06399a47
Thank you!

More Related Content

Similar to Automated product categorization

Similar to Automated product categorization (20)

Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
KP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation MethodologyKP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation Methodology
 
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
 
Boosting the Performance of your Rails Apps
Boosting the Performance of your Rails AppsBoosting the Performance of your Rails Apps
Boosting the Performance of your Rails Apps
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
 
Performance Testing Java Applications
Performance Testing Java ApplicationsPerformance Testing Java Applications
Performance Testing Java Applications
 
Visual Studio Profiler
Visual Studio ProfilerVisual Studio Profiler
Visual Studio Profiler
 
The Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance TuningThe Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance Tuning
 
Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2
 
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
 
AWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
AWS re:Invent 2014 | (ARC202) Real-World Real-Time AnalyticsAWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
AWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
 
Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017
 
CQRS recepies
CQRS recepiesCQRS recepies
CQRS recepies
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Alfresco Business Reporting - Tech Talk Live 20130501
Alfresco Business Reporting - Tech Talk Live 20130501Alfresco Business Reporting - Tech Talk Live 20130501
Alfresco Business Reporting - Tech Talk Live 20130501
 
Test strategy utilising mc useful tools
Test strategy utilising mc useful toolsTest strategy utilising mc useful tools
Test strategy utilising mc useful tools
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Design principles & quality factors
Design principles & quality factorsDesign principles & quality factors
Design principles & quality factors
 

More from Warply

4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...
Warply
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...
Warply
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...
Warply
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...
Warply
 

More from Warply (20)

Chatbot workshop - How to build one.#digitized16
Chatbot workshop - How to build one.#digitized16Chatbot workshop - How to build one.#digitized16
Chatbot workshop - How to build one.#digitized16
 
Chatbot workshop introduction.#digitized16
Chatbot workshop introduction.#digitized16 Chatbot workshop introduction.#digitized16
Chatbot workshop introduction.#digitized16
 
Warply Mobile Banking solutions
Warply Mobile Banking solutionsWarply Mobile Banking solutions
Warply Mobile Banking solutions
 
Warply Mobile Banking solutions
Warply Mobile Banking solutionsWarply Mobile Banking solutions
Warply Mobile Banking solutions
 
Chatbots - A new era in digital banking
Chatbots - A new era in digital bankingChatbots - A new era in digital banking
Chatbots - A new era in digital banking
 
The CNN Greece Case study
The CNN Greece Case studyThe CNN Greece Case study
The CNN Greece Case study
 
Programmatic Demystified (?)
Programmatic Demystified (?)Programmatic Demystified (?)
Programmatic Demystified (?)
 
In store Retail sap forum
In store Retail sap forum In store Retail sap forum
In store Retail sap forum
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Lavip...
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_Nestl...
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_First...
 
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...
4th Mobile Marketing Event by Warply: Mobile: In-store Retail’s New Era_iStor...
 
Approaching Customers on Mobile. Bonus: sneak peek of Warply Engage Platform 2.0
Approaching Customers on Mobile. Bonus: sneak peek of Warply Engage Platform 2.0Approaching Customers on Mobile. Bonus: sneak peek of Warply Engage Platform 2.0
Approaching Customers on Mobile. Bonus: sneak peek of Warply Engage Platform 2.0
 
Mobile Payments Event by Warply: Eurobank's presentation
Mobile Payments Event by Warply: Eurobank's presentationMobile Payments Event by Warply: Eurobank's presentation
Mobile Payments Event by Warply: Eurobank's presentation
 
Mobile Payments Event by Warply: Apple Pay
Mobile Payments Event by Warply: Apple PayMobile Payments Event by Warply: Apple Pay
Mobile Payments Event by Warply: Apple Pay
 
Data Privacy in Modern Advertisement
Data Privacy in Modern AdvertisementData Privacy in Modern Advertisement
Data Privacy in Modern Advertisement
 
Mobile Loyalty that works: a successful case study by Warply and Eurobank
Mobile Loyalty that works: a successful case study by Warply and Eurobank Mobile Loyalty that works: a successful case study by Warply and Eurobank
Mobile Loyalty that works: a successful case study by Warply and Eurobank
 
3rd Mobile Marketing Event by Warply: WIND Telecommunications Hellas presenta...
3rd Mobile Marketing Event by Warply: WIND Telecommunications Hellas presenta...3rd Mobile Marketing Event by Warply: WIND Telecommunications Hellas presenta...
3rd Mobile Marketing Event by Warply: WIND Telecommunications Hellas presenta...
 
3rd Mobile Marketing event by Warply: Mobile as a Revenue Channel
3rd Mobile Marketing event by Warply: Mobile as a Revenue Channel3rd Mobile Marketing event by Warply: Mobile as a Revenue Channel
3rd Mobile Marketing event by Warply: Mobile as a Revenue Channel
 
3rd Mobile Marketing event by Warply: Travelplanet24 presentation
3rd Mobile Marketing event by Warply: Travelplanet24 presentation3rd Mobile Marketing event by Warply: Travelplanet24 presentation
3rd Mobile Marketing event by Warply: Travelplanet24 presentation
 

Recently uploaded

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Automated product categorization

  • 1.
  • 2. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 3. #1 What is Skroutz.gr? • Skroutz.gr is a marketplace & shopping assistant which makes online shopping easier and more reliable • It includes more than 11,000,000 products from 3,200 different e-shops • On a monthly basis the website welcomes more than 8 million unique visitors ranking in the top positions in the Greek Web
  • 4. #1 Some Numbers 3,200 merchants 11 million products 270 mil. pageviews / mo 1.1 mil. searches/day 33 mil. sessions/mo
  • 5. #1 The Problem • Each day we collect thousands of new products by downloading e-shop feeds (XML, CSV etc. - product catalogs) • We want to categorize incoming product payloads as provided by eshops to the most relevant categories in Skroutz category tree taxonomy with the minimum human intervention. - Difficult - Important
  • 6. #1 Why Difficult? • Many leaf categories in Skroutz taxonomy (>2k) • Sibling categories (subjective categorization) • Misleading product titles and shop-categories from shops
  • 7. #1 Why Important? Robot MO collects products from shop feeds and stores them to DB Megatron category classifier categorizes products to the correct category Tron groups similar products to entities called SKUs to be ready for indexing Elasticsearch indexes products to be searchable from user interface
  • 8. #1 Facts •Merchants send more than ~15k new products every day in Skroutz!!! •2.3k unique leaf categories in our category tree (taxonomy) •Manual “move-to-category” action: - Costs ~7.8s on average for content managers - Subjective decisions may add extra overhead
  • 9. #1 Old Solution - Overview •Use Elasticseach to match specific product attributes: - PN (manufacturer part number) - Name - Shop-category •Aggregate matches and group by categories •Normalize results and use custom weights to calculate a score •Take Top-K results
  • 10. #1 Old Solution - Limitations •Plain cosine similarity distance on TF/IDF weights: - No learning feedback loop - No advanced statistics utilization (e.g. correlation between price value and text features) •No easy way to tune custom weights applied on final scoring •Heuristics don’t take into account category specific context •Heuristics don’t take into account word level context. E.g. word “samsung” is followed by word “galaxy” most of the time and then probably follows a model number.
  • 11. #1 Old Solution - Good Parts •Simple solution (except for custom scoring stuff) •Easy to debug •Easy to deploy •Online
  • 12. #1 New Solution - “Megatron”
  • 13. #1 Overview •Approach problem as a supervised learning task •Rely on probabilities to obtain a meaningful score •Use more features from multiple sources and use datasets •Learn new patterns and relations by training •Measure performance on dataset splits •Use a microservice to serve classification requests •Apply threshold for low confidence results
  • 14. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 15. #2 Service Architecture 1.Training Phase 2.Inference Phase 3.APIs
  • 16.
  • 17. #2.1 Training Phase 1. Export dataset (product features labeled with category_id) and upload to Swift 2. Download specific dataset version in “training VM” 3. Start a training session using a train/val split from dataset 4. Save best performing model params snapshot (based on validation set loss) 5. Compress and upload model params to Swift container
  • 18. #2.2 Inference Phase 1. Application Part: Send classification request upon new product arrivals: - Kafka producer (asynchronous request) - Megatron Client HTTP synchronous requests (2nd alternative) 2. Category Classifier Microservice Part: - Pop messages from stream (Kafka consumer) - Dispatch messages to in-memory Neural Network instance - Fetch predictions (scores) and post-back to Core Application API endpoint
  • 19. #2.3 APIs 1. Megatron microservice internal API - Common API (wraps Keras API) - Basic methods: ✓ build ✓ train ✓ save ✓ load ✓ predict - CLI commands
  • 20. #2.3 APIs(2) 1. Skroutz Application Ecosystem (Ruby client) - Megatron::Client ✓ Issues requests to microservice - Megatron DB model ✓ Stores prediction results - ApiController endpoint ✓ Receives callbacks from microservice
  • 21. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 22. #3 Data •Product attribute values (potential features) Product Name Shop manufacturer Part number EAN Price Shop category ... Samsung TV 32'' DF324 (PNDFD22) Full HD Black NEW Αρχική > Ηλεκτρονικά > Τηλεοράσεις PNDFD22 300 €
  • 23. #3 Data(2) •Training Dataset - Raw Features Image Numerical Categorical Label Text
  • 24. #3 Data(3) •Preprocessing - Text - Numerical - Categorical - Labels X y
  • 25. #3 Preprocessing - Text • Our best solution involves “Word Vectors” • Steps to prepare for word vectors: - Learn a words Vocabulary (mapping of words to numeric id) - Transform text sentences to Sequences of ids based on Vocabulary - Decide a representative sequence length (E.g. 60 words) - Apply zero padding (pre or post) and truncation to maintain a fixed length
  • 26. #3 Preprocessing - Text(2) • Use of Pretrained Embeddings (see W2Vec, FastText, GloVe etc.) • We use FastText library with skipgram algorithm (unsupervised) - https://fasttext.cc/docs/en/unsupervised-tutorial.html
  • 27. #3 Preprocessing - Text(3) • Embeddings: - Outputs 100 dim Vector - Total 1,500,000 rows (vocab) • 2 versions (Name, Shop-category)
  • 28. #3 Preprocessing - Numerical • “Pricevat” and “Name Length” values • Apply Standard Scaling
  • 29. #3 Preprocessing - Categorical • All discrete value attributes/features: - shop_id - matching Product PNs category_id list • One-Hot encoding:
  • 30. #3 Label Encoding • “category_id “ values are the “true” labels which should be learned by NN • One-Hot encoding • OR just use IDs and rely to “Keras” conventions (E.g. use an internal sparse categorical representation to save huge amounts of RAM)
  • 31. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 32. #4 Training 1.Basic Concepts 2.Model Architecture 3.Training “In Action”
  • 33. #4.1 Basic Concepts •Objective: - Find a combination of mathematical functions and a set of corresponding params to maximize prediction accuracy (or minimize error rate). - Ensure that the above generalizes well for production. - Learn params in an acceptable time window. •Experiment with Neural Network architectures •GPUS to the rescue (speedup x10)
  • 34. #4.1 Basic Concepts(2) •Loss function - Categorical Crossentropy •Optimizer - Adam (Gradient Descent) •Hyper-params - Mini-Batch Size - Learning Rate - Epochs
  • 35. #4.1 Validation •Why? - Simulate unseen data - Compare different: ✓ training methods ✓ hyper -params - Avoid Overfitting •Should be representative •Validation Strategy - 10% of whole Dataset - Stratification on Categories
  • 37. #4.2 Model Architecture •Hybrid End-to-End architecture •4 branches (4 input vectors): A. Name Features Branch B. Shop-Category Features Branch C. Basic Features Branch (Numerics, Categorical) D. Matching PNs Branch (Categorical) Text
  • 38. #4.2 Text Branches • Inspired by “Embed, Encode, Attend, Predict” - https://explosion.ai/blog/deep-learning-formula-nlp • Each of “name” and “shop-category” sequence flows through: - 1 x Embeddings Layer - 1 x Bi-LSTM Encoder - 1 x Attention Module - 1 x LSTM Encoder
  • 39. #4.2 Text Branches - Why LSTM? • LSTM stands for “Long Short Term Memory” Layer (Encoder): - Memory Cells / Captures context - Propagates signal from previous words to the next in a Sequence - 2 Stacked Layers performed better in our experiments - 128 dimension output vector - https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 128dim #cells = sequence length
  • 40. #4.2 Text Branches - Why pay Attention? • Attention Mechanism: - Controll how much signal should be propagated to next layers - https://distill.pub/2016/augmented-rnns/
  • 41. #4.2 Other Branches • Basic Features Branch - Inputs a concatenation of basic feats - 1 Dense layer with #classes output - ReLU activation • Matching PNs Branch - Inputs a concatenation of PN feats - Short-circuited to final layer InputVector #classes (~2kforSkroutz)
  • 42. #4.2 Final Layer • Merging Layer - Concatenates all 4 branches outputs - softmax activation - Output: probabilities for each class
  • 43. #4.2
  • 44. #4.2 Model Architecture • Model Capacity/Complexity:
  • 45. #4.3 Training In Action - Model Selection •Conducted 100s of experiments with different combinations of features, layers, modules (e.g. Embeddings, Bag of Words, TF/IDF, LSTM, etc.) •10s of Ablations studies: remove specific features to see how performance is affected •Read many papers and applied some common tricks (Bi-LSTM, AdaptivePooling etc.) •It is an alchemy!
  • 46. #4.3 Training In Action - Tools •Training Scheduler Process runs weekly •CLI training commands - CUDA_VISIBLE_DEVICES=1 python -m category_classifier.cli scrooge --model end2end --train --epochs 8 --batch_size 128 •Model Versioning - E.g. “skroutz_models_2018_09_01_v1.tar.gz”
  • 47. #4.3 Training In Action Training run output example: GPU monitoring:
  • 48. #4.3 Training In Action Learning Curves (Tensorboard): Current best Previous Arch Current bestPrevious Arch
  • 49. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 51. #5.1 Inference Pipeline •Online execution: - preprocessing - vectorization - Prediction •Utilized by CategoryClassifier Class - Wrapper of external API •Utilize scikit-learn Pipelines - http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
  • 52. #5.2 Inference API • REPL • Kafka Worker • Flask App
  • 53. #5.3 Production •x2 inference VMs - inference1.skroutz.gr, inference2.skroutz.gr (Kafka Workers) •x2 Flavors (Greece, UK) •Grafana Monitoring for Kafka Part
  • 54. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 55. #6 Evaluation •More than 6% error rate reduction overall in Skroutz! •Currently, more than ~2 content-editor hours saved per day in Skroutz (this is scaling)! •Move operations from list with “uncategorized” products reduced significantly (by an order of magnitude)!
  • 56. #6 Performance Summary Success Rate Failure Rate No Prediction Rate Megatron Old Megatron Old Megatron Old Skroutz (GR) 2.3k categories 90.10% 82.6% 7.9% 13.8% 2% 3.5% 91.85% 85.7% 8.14% 14.32% N/A N/A Scrooge (UK) 350 categories 87.56% 38.9% 2.5% 26.24% 9.9% 58.48% 97.1% 93.67% 2.8% 6.32% N/A N/A
  • 58. #Future Improvements • Utilize Image Features (in End-To-End model) • Utilize Entity Recognition to extract more features • Find ways to utilize more features (color, sizes etc.) • Categorical Self-Trained Embeddings • Experiment with newer solutions like “Transformer”
  • 59. #Contact Info Andreas Loupasakis • Email: alup@skroutz.gr • Kaggle: https://www.kaggle.com/andreaslup • Twitter: https://twitter.com/andy_lupo • LinkedIn: https://www.linkedin.com/in/andreas-loupasakis-06399a47