SlideShare a Scribd company logo
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#1 What is Skroutz.gr?
• Skroutz.gr is a marketplace & shopping assistant which
makes online shopping easier and more reliable
• It includes more than 11,000,000 products from 3,200
different e-shops
• On a monthly basis the website welcomes more than 8
million unique visitors ranking in the top positions in the
Greek Web
#1 Some Numbers
3,200
merchants
11
million products
270 mil.
pageviews /
mo
1.1 mil.
searches/day
33 mil.
sessions/mo
#1 The Problem
• Each day we collect thousands of new products by
downloading e-shop feeds (XML, CSV etc. - product
catalogs)
• We want to categorize incoming product payloads as
provided by eshops to the most relevant categories in
Skroutz category tree taxonomy with the minimum human
intervention.
- Difficult
- Important
#1 Why Difficult?
• Many leaf categories in
Skroutz taxonomy (>2k)
• Sibling categories
(subjective categorization)
• Misleading product titles
and shop-categories from
shops
#1 Why Important?
Robot MO collects
products from shop
feeds and stores them
to DB
Megatron category
classifier categorizes
products to the correct
category
Tron groups similar
products to entities
called SKUs to be
ready for indexing
Elasticsearch indexes
products to be
searchable from user
interface
#1 Facts
•Merchants send more than ~15k new products every day in
Skroutz!!!
•2.3k unique leaf categories in our category tree (taxonomy)
•Manual “move-to-category” action:
- Costs ~7.8s on average for content managers
- Subjective decisions may add extra overhead
#1 Old Solution - Overview
•Use Elasticseach to match specific product attributes:
- PN (manufacturer part number)
- Name
- Shop-category
•Aggregate matches and group by categories
•Normalize results and use custom weights to calculate a score
•Take Top-K results
#1 Old Solution - Limitations
•Plain cosine similarity distance on TF/IDF weights:
- No learning feedback loop
- No advanced statistics utilization (e.g. correlation between price
value and text features)
•No easy way to tune custom weights applied on final scoring
•Heuristics don’t take into account category specific context
•Heuristics don’t take into account word level context. E.g.
word “samsung” is followed by word “galaxy” most of the time
and then probably follows a model number.
#1 Old Solution - Good Parts
•Simple solution (except for custom scoring stuff)
•Easy to debug
•Easy to deploy
•Online
#1 New Solution - “Megatron”
#1 Overview
•Approach problem as a supervised learning task
•Rely on probabilities to obtain a meaningful score
•Use more features from multiple sources and use datasets
•Learn new patterns and relations by training
•Measure performance on dataset splits
•Use a microservice to serve classification requests
•Apply threshold for low confidence results
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#2 Service Architecture
1.Training Phase
2.Inference Phase
3.APIs
#2.1 Training Phase
1. Export dataset (product features labeled with category_id) and upload to Swift
2. Download specific dataset version in “training VM”
3. Start a training session using a train/val split from dataset
4. Save best performing model params snapshot (based on validation set loss)
5. Compress and upload model params to Swift container
#2.2 Inference Phase
1. Application Part: Send classification request
upon new product arrivals:
- Kafka producer (asynchronous request)
- Megatron Client HTTP synchronous
requests (2nd alternative)
2. Category Classifier Microservice Part:
- Pop messages from stream (Kafka
consumer)
- Dispatch messages to in-memory Neural
Network instance
- Fetch predictions (scores) and post-back
to Core Application API endpoint
#2.3 APIs
1. Megatron microservice internal API
- Common API (wraps Keras API)
- Basic methods:
✓ build
✓ train
✓ save
✓ load
✓ predict
- CLI commands
#2.3 APIs(2)
1. Skroutz Application Ecosystem (Ruby client)
- Megatron::Client
✓ Issues requests to microservice
- Megatron DB model
✓ Stores prediction results
- ApiController endpoint
✓ Receives callbacks from microservice
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#3 Data
•Product attribute values (potential features)
Product
Name
Shop manufacturer
Part number
EAN
Price
Shop category
...
Samsung TV 32'' DF324 (PNDFD22) Full HD Black NEW
Αρχική > Ηλεκτρονικά > Τηλεοράσεις
PNDFD22
300 €
#3 Data(2)
•Training Dataset - Raw Features
Image
Numerical
Categorical
Label
Text
#3 Data(3)
•Preprocessing
- Text
- Numerical
- Categorical
- Labels
X
y
#3 Preprocessing - Text
• Our best solution involves “Word Vectors”
• Steps to prepare for word vectors:
- Learn a words Vocabulary (mapping of words to numeric id)
- Transform text sentences to Sequences of ids based on Vocabulary
- Decide a representative sequence length (E.g. 60 words)
- Apply zero padding (pre or post) and truncation to maintain a fixed length
#3 Preprocessing - Text(2)
• Use of Pretrained Embeddings (see W2Vec, FastText, GloVe etc.)
• We use FastText library with skipgram algorithm (unsupervised)
- https://fasttext.cc/docs/en/unsupervised-tutorial.html
#3 Preprocessing - Text(3)
• Embeddings:
- Outputs 100 dim Vector
- Total 1,500,000 rows (vocab)
• 2 versions (Name, Shop-category)
#3 Preprocessing - Numerical
• “Pricevat” and “Name Length” values
• Apply Standard Scaling
#3 Preprocessing - Categorical
• All discrete value attributes/features:
- shop_id
- matching Product PNs category_id list
• One-Hot encoding:
#3 Label Encoding
• “category_id “ values are the “true” labels which should be learned by NN
• One-Hot encoding
• OR just use IDs and rely to “Keras” conventions (E.g. use an internal sparse categorical
representation to save huge amounts of RAM)
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#4 Training
1.Basic Concepts
2.Model Architecture
3.Training “In Action”
#4.1 Basic Concepts
•Objective:
- Find a combination of mathematical functions and a set of
corresponding params to maximize prediction accuracy (or minimize
error rate).
- Ensure that the above generalizes well for production.
- Learn params in an acceptable time window.
•Experiment with Neural Network architectures
•GPUS to the rescue (speedup x10)
#4.1 Basic Concepts(2)
•Loss function
- Categorical Crossentropy
•Optimizer
- Adam (Gradient Descent)
•Hyper-params
- Mini-Batch Size
- Learning Rate
- Epochs
#4.1 Validation
•Why?
- Simulate unseen data
- Compare different:
✓ training methods
✓ hyper -params
- Avoid Overfitting
•Should be representative
•Validation Strategy
- 10% of whole Dataset
- Stratification on Categories
#4.2 Model Architecture
Text
#4.2 Model Architecture
•Hybrid End-to-End architecture
•4 branches (4 input vectors):
A. Name Features Branch
B. Shop-Category Features Branch
C. Basic Features Branch (Numerics, Categorical)
D. Matching PNs Branch (Categorical)
Text
#4.2 Text Branches
• Inspired by “Embed, Encode, Attend, Predict”
- https://explosion.ai/blog/deep-learning-formula-nlp
• Each of “name” and “shop-category” sequence flows through:
- 1 x Embeddings Layer
- 1 x Bi-LSTM Encoder
- 1 x Attention Module
- 1 x LSTM Encoder
#4.2 Text Branches - Why LSTM?
• LSTM stands for “Long Short Term Memory” Layer (Encoder):
- Memory Cells / Captures context
- Propagates signal from previous words to the next in a Sequence
- 2 Stacked Layers performed better in our experiments
- 128 dimension output vector
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
128dim
#cells = sequence length
#4.2 Text Branches - Why pay Attention?
• Attention Mechanism:
- Controll how much signal should be propagated to next layers
- https://distill.pub/2016/augmented-rnns/
#4.2 Other Branches
• Basic Features Branch
- Inputs a concatenation of basic feats
- 1 Dense layer with #classes output
- ReLU activation
• Matching PNs Branch
- Inputs a concatenation of PN feats
- Short-circuited to final layer
InputVector
#classes
(~2kforSkroutz)
#4.2 Final Layer
• Merging Layer
- Concatenates all 4 branches outputs
- softmax activation
- Output: probabilities for each class
#4.2
#4.2 Model Architecture
• Model Capacity/Complexity:
#4.3 Training In Action - Model Selection
•Conducted 100s of experiments with different combinations
of features, layers, modules (e.g. Embeddings, Bag of Words,
TF/IDF, LSTM, etc.)
•10s of Ablations studies: remove specific features to see how
performance is affected
•Read many papers and applied some common tricks (Bi-LSTM,
AdaptivePooling etc.)
•It is an alchemy!
#4.3 Training In Action - Tools
•Training Scheduler Process runs weekly
•CLI training commands
- CUDA_VISIBLE_DEVICES=1 python -m category_classifier.cli scrooge --model end2end --train --epochs
8 --batch_size 128
•Model Versioning
- E.g. “skroutz_models_2018_09_01_v1.tar.gz”
#4.3 Training In Action
Training run output example:
GPU monitoring:
#4.3 Training In Action
Learning Curves (Tensorboard):
Current best
Previous Arch Current bestPrevious Arch
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#5 Inference
1.Inference Pipeline
2.Inference API
3.Production
#5.1 Inference Pipeline
•Online execution:
- preprocessing
- vectorization
- Prediction
•Utilized by CategoryClassifier Class
- Wrapper of external API
•Utilize scikit-learn Pipelines
- http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
#5.2 Inference API
• REPL
• Kafka Worker
• Flask App
#5.3 Production
•x2 inference VMs
- inference1.skroutz.gr, inference2.skroutz.gr (Kafka Workers)
•x2 Flavors (Greece, UK)
•Grafana Monitoring for Kafka Part
Chapters
1. Introduction
2. Service Architecture
3. Data
4. Training
5. Inference
6. Evaluation
#6 Evaluation
•More than 6% error rate reduction overall in Skroutz!
•Currently, more than ~2 content-editor hours saved per day in
Skroutz (this is scaling)!
•Move operations from list with “uncategorized” products
reduced significantly (by an order of magnitude)!
#6 Performance Summary
Success Rate Failure Rate No Prediction Rate
Megatron Old Megatron Old Megatron Old
Skroutz (GR)
2.3k categories
90.10% 82.6% 7.9% 13.8% 2% 3.5%
91.85% 85.7% 8.14% 14.32% N/A N/A
Scrooge (UK)
350 categories
87.56% 38.9% 2.5% 26.24% 9.9% 58.48%
97.1% 93.67% 2.8% 6.32% N/A N/A
#6 Monitoring Dashboard
#Future Improvements
• Utilize Image Features (in End-To-End model)
• Utilize Entity Recognition to extract more features
• Find ways to utilize more features (color, sizes etc.)
• Categorical Self-Trained Embeddings
• Experiment with newer solutions like “Transformer”
#Contact Info
Andreas Loupasakis
• Email: alup@skroutz.gr
• Kaggle: https://www.kaggle.com/andreaslup
• Twitter: https://twitter.com/andy_lupo
• LinkedIn: https://www.linkedin.com/in/andreas-loupasakis-06399a47
Thank you!

More Related Content

Similar to Automated product categorization

KP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation MethodologyKP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation Methodology
DataStax Academy
 
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
Insight Technology, Inc.
 
Boosting the Performance of your Rails Apps
Boosting the Performance of your Rails AppsBoosting the Performance of your Rails Apps
Boosting the Performance of your Rails Apps
Matt Kuklinski
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Daniel Coupal
 
Performance Testing Java Applications
Performance Testing Java ApplicationsPerformance Testing Java Applications
Performance Testing Java Applications
C4Media
 
Visual Studio Profiler
Visual Studio ProfilerVisual Studio Profiler
Visual Studio Profiler
Betclic Everest Group Tech Team
 
The Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance TuningThe Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance Tuning
jClarity
 
Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2
Dana Elisabeth Groce
 
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
CodeScience
 
AWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
AWS re:Invent 2014 | (ARC202) Real-World Real-Time AnalyticsAWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
AWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
Socialmetrix
 
Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017
Sumo Logic
 
CQRS recepies
CQRS recepiesCQRS recepies
CQRS recepies
Francesco Garavaglia
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Alfresco Business Reporting - Tech Talk Live 20130501
Alfresco Business Reporting - Tech Talk Live 20130501Alfresco Business Reporting - Tech Talk Live 20130501
Alfresco Business Reporting - Tech Talk Live 20130501
Tjarda Peelen
 
Test strategy utilising mc useful tools
Test strategy utilising mc useful toolsTest strategy utilising mc useful tools
Test strategy utilising mc useful tools
Mark Chappell
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
Design principles & quality factors
Design principles & quality factorsDesign principles & quality factors
Design principles & quality factors
Aalia Barbe
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
ITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
ITARC15 Workshop - Architecting a Large Software Project - Lessons LearnedITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
ITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
João Pedro Martins
 

Similar to Automated product categorization (20)

KP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation MethodologyKP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation Methodology
 
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
 
Boosting the Performance of your Rails Apps
Boosting the Performance of your Rails AppsBoosting the Performance of your Rails Apps
Boosting the Performance of your Rails Apps
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
 
Performance Testing Java Applications
Performance Testing Java ApplicationsPerformance Testing Java Applications
Performance Testing Java Applications
 
Visual Studio Profiler
Visual Studio ProfilerVisual Studio Profiler
Visual Studio Profiler
 
The Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance TuningThe Diabolical Developers Guide to Performance Tuning
The Diabolical Developers Guide to Performance Tuning
 
Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2
 
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
 
AWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
AWS re:Invent 2014 | (ARC202) Real-World Real-Time AnalyticsAWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
AWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
 
Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017Sumo Logic Quickstart - Jan 2017
Sumo Logic Quickstart - Jan 2017
 
CQRS recepies
CQRS recepiesCQRS recepies
CQRS recepies
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Alfresco Business Reporting - Tech Talk Live 20130501
Alfresco Business Reporting - Tech Talk Live 20130501Alfresco Business Reporting - Tech Talk Live 20130501
Alfresco Business Reporting - Tech Talk Live 20130501
 
Test strategy utilising mc useful tools
Test strategy utilising mc useful toolsTest strategy utilising mc useful tools
Test strategy utilising mc useful tools
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Design principles & quality factors
Design principles & quality factorsDesign principles & quality factors
Design principles & quality factors
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
ITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
ITARC15 Workshop - Architecting a Large Software Project - Lessons LearnedITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
ITARC15 Workshop - Architecting a Large Software Project - Lessons Learned
 

Recently uploaded

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 

Recently uploaded (20)

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 

Automated product categorization

  • 1.
  • 2. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 3. #1 What is Skroutz.gr? • Skroutz.gr is a marketplace & shopping assistant which makes online shopping easier and more reliable • It includes more than 11,000,000 products from 3,200 different e-shops • On a monthly basis the website welcomes more than 8 million unique visitors ranking in the top positions in the Greek Web
  • 4. #1 Some Numbers 3,200 merchants 11 million products 270 mil. pageviews / mo 1.1 mil. searches/day 33 mil. sessions/mo
  • 5. #1 The Problem • Each day we collect thousands of new products by downloading e-shop feeds (XML, CSV etc. - product catalogs) • We want to categorize incoming product payloads as provided by eshops to the most relevant categories in Skroutz category tree taxonomy with the minimum human intervention. - Difficult - Important
  • 6. #1 Why Difficult? • Many leaf categories in Skroutz taxonomy (>2k) • Sibling categories (subjective categorization) • Misleading product titles and shop-categories from shops
  • 7. #1 Why Important? Robot MO collects products from shop feeds and stores them to DB Megatron category classifier categorizes products to the correct category Tron groups similar products to entities called SKUs to be ready for indexing Elasticsearch indexes products to be searchable from user interface
  • 8. #1 Facts •Merchants send more than ~15k new products every day in Skroutz!!! •2.3k unique leaf categories in our category tree (taxonomy) •Manual “move-to-category” action: - Costs ~7.8s on average for content managers - Subjective decisions may add extra overhead
  • 9. #1 Old Solution - Overview •Use Elasticseach to match specific product attributes: - PN (manufacturer part number) - Name - Shop-category •Aggregate matches and group by categories •Normalize results and use custom weights to calculate a score •Take Top-K results
  • 10. #1 Old Solution - Limitations •Plain cosine similarity distance on TF/IDF weights: - No learning feedback loop - No advanced statistics utilization (e.g. correlation between price value and text features) •No easy way to tune custom weights applied on final scoring •Heuristics don’t take into account category specific context •Heuristics don’t take into account word level context. E.g. word “samsung” is followed by word “galaxy” most of the time and then probably follows a model number.
  • 11. #1 Old Solution - Good Parts •Simple solution (except for custom scoring stuff) •Easy to debug •Easy to deploy •Online
  • 12. #1 New Solution - “Megatron”
  • 13. #1 Overview •Approach problem as a supervised learning task •Rely on probabilities to obtain a meaningful score •Use more features from multiple sources and use datasets •Learn new patterns and relations by training •Measure performance on dataset splits •Use a microservice to serve classification requests •Apply threshold for low confidence results
  • 14. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 15. #2 Service Architecture 1.Training Phase 2.Inference Phase 3.APIs
  • 16.
  • 17. #2.1 Training Phase 1. Export dataset (product features labeled with category_id) and upload to Swift 2. Download specific dataset version in “training VM” 3. Start a training session using a train/val split from dataset 4. Save best performing model params snapshot (based on validation set loss) 5. Compress and upload model params to Swift container
  • 18. #2.2 Inference Phase 1. Application Part: Send classification request upon new product arrivals: - Kafka producer (asynchronous request) - Megatron Client HTTP synchronous requests (2nd alternative) 2. Category Classifier Microservice Part: - Pop messages from stream (Kafka consumer) - Dispatch messages to in-memory Neural Network instance - Fetch predictions (scores) and post-back to Core Application API endpoint
  • 19. #2.3 APIs 1. Megatron microservice internal API - Common API (wraps Keras API) - Basic methods: ✓ build ✓ train ✓ save ✓ load ✓ predict - CLI commands
  • 20. #2.3 APIs(2) 1. Skroutz Application Ecosystem (Ruby client) - Megatron::Client ✓ Issues requests to microservice - Megatron DB model ✓ Stores prediction results - ApiController endpoint ✓ Receives callbacks from microservice
  • 21. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 22. #3 Data •Product attribute values (potential features) Product Name Shop manufacturer Part number EAN Price Shop category ... Samsung TV 32'' DF324 (PNDFD22) Full HD Black NEW Αρχική > Ηλεκτρονικά > Τηλεοράσεις PNDFD22 300 €
  • 23. #3 Data(2) •Training Dataset - Raw Features Image Numerical Categorical Label Text
  • 24. #3 Data(3) •Preprocessing - Text - Numerical - Categorical - Labels X y
  • 25. #3 Preprocessing - Text • Our best solution involves “Word Vectors” • Steps to prepare for word vectors: - Learn a words Vocabulary (mapping of words to numeric id) - Transform text sentences to Sequences of ids based on Vocabulary - Decide a representative sequence length (E.g. 60 words) - Apply zero padding (pre or post) and truncation to maintain a fixed length
  • 26. #3 Preprocessing - Text(2) • Use of Pretrained Embeddings (see W2Vec, FastText, GloVe etc.) • We use FastText library with skipgram algorithm (unsupervised) - https://fasttext.cc/docs/en/unsupervised-tutorial.html
  • 27. #3 Preprocessing - Text(3) • Embeddings: - Outputs 100 dim Vector - Total 1,500,000 rows (vocab) • 2 versions (Name, Shop-category)
  • 28. #3 Preprocessing - Numerical • “Pricevat” and “Name Length” values • Apply Standard Scaling
  • 29. #3 Preprocessing - Categorical • All discrete value attributes/features: - shop_id - matching Product PNs category_id list • One-Hot encoding:
  • 30. #3 Label Encoding • “category_id “ values are the “true” labels which should be learned by NN • One-Hot encoding • OR just use IDs and rely to “Keras” conventions (E.g. use an internal sparse categorical representation to save huge amounts of RAM)
  • 31. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 32. #4 Training 1.Basic Concepts 2.Model Architecture 3.Training “In Action”
  • 33. #4.1 Basic Concepts •Objective: - Find a combination of mathematical functions and a set of corresponding params to maximize prediction accuracy (or minimize error rate). - Ensure that the above generalizes well for production. - Learn params in an acceptable time window. •Experiment with Neural Network architectures •GPUS to the rescue (speedup x10)
  • 34. #4.1 Basic Concepts(2) •Loss function - Categorical Crossentropy •Optimizer - Adam (Gradient Descent) •Hyper-params - Mini-Batch Size - Learning Rate - Epochs
  • 35. #4.1 Validation •Why? - Simulate unseen data - Compare different: ✓ training methods ✓ hyper -params - Avoid Overfitting •Should be representative •Validation Strategy - 10% of whole Dataset - Stratification on Categories
  • 37. #4.2 Model Architecture •Hybrid End-to-End architecture •4 branches (4 input vectors): A. Name Features Branch B. Shop-Category Features Branch C. Basic Features Branch (Numerics, Categorical) D. Matching PNs Branch (Categorical) Text
  • 38. #4.2 Text Branches • Inspired by “Embed, Encode, Attend, Predict” - https://explosion.ai/blog/deep-learning-formula-nlp • Each of “name” and “shop-category” sequence flows through: - 1 x Embeddings Layer - 1 x Bi-LSTM Encoder - 1 x Attention Module - 1 x LSTM Encoder
  • 39. #4.2 Text Branches - Why LSTM? • LSTM stands for “Long Short Term Memory” Layer (Encoder): - Memory Cells / Captures context - Propagates signal from previous words to the next in a Sequence - 2 Stacked Layers performed better in our experiments - 128 dimension output vector - https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 128dim #cells = sequence length
  • 40. #4.2 Text Branches - Why pay Attention? • Attention Mechanism: - Controll how much signal should be propagated to next layers - https://distill.pub/2016/augmented-rnns/
  • 41. #4.2 Other Branches • Basic Features Branch - Inputs a concatenation of basic feats - 1 Dense layer with #classes output - ReLU activation • Matching PNs Branch - Inputs a concatenation of PN feats - Short-circuited to final layer InputVector #classes (~2kforSkroutz)
  • 42. #4.2 Final Layer • Merging Layer - Concatenates all 4 branches outputs - softmax activation - Output: probabilities for each class
  • 43. #4.2
  • 44. #4.2 Model Architecture • Model Capacity/Complexity:
  • 45. #4.3 Training In Action - Model Selection •Conducted 100s of experiments with different combinations of features, layers, modules (e.g. Embeddings, Bag of Words, TF/IDF, LSTM, etc.) •10s of Ablations studies: remove specific features to see how performance is affected •Read many papers and applied some common tricks (Bi-LSTM, AdaptivePooling etc.) •It is an alchemy!
  • 46. #4.3 Training In Action - Tools •Training Scheduler Process runs weekly •CLI training commands - CUDA_VISIBLE_DEVICES=1 python -m category_classifier.cli scrooge --model end2end --train --epochs 8 --batch_size 128 •Model Versioning - E.g. “skroutz_models_2018_09_01_v1.tar.gz”
  • 47. #4.3 Training In Action Training run output example: GPU monitoring:
  • 48. #4.3 Training In Action Learning Curves (Tensorboard): Current best Previous Arch Current bestPrevious Arch
  • 49. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 51. #5.1 Inference Pipeline •Online execution: - preprocessing - vectorization - Prediction •Utilized by CategoryClassifier Class - Wrapper of external API •Utilize scikit-learn Pipelines - http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
  • 52. #5.2 Inference API • REPL • Kafka Worker • Flask App
  • 53. #5.3 Production •x2 inference VMs - inference1.skroutz.gr, inference2.skroutz.gr (Kafka Workers) •x2 Flavors (Greece, UK) •Grafana Monitoring for Kafka Part
  • 54. Chapters 1. Introduction 2. Service Architecture 3. Data 4. Training 5. Inference 6. Evaluation
  • 55. #6 Evaluation •More than 6% error rate reduction overall in Skroutz! •Currently, more than ~2 content-editor hours saved per day in Skroutz (this is scaling)! •Move operations from list with “uncategorized” products reduced significantly (by an order of magnitude)!
  • 56. #6 Performance Summary Success Rate Failure Rate No Prediction Rate Megatron Old Megatron Old Megatron Old Skroutz (GR) 2.3k categories 90.10% 82.6% 7.9% 13.8% 2% 3.5% 91.85% 85.7% 8.14% 14.32% N/A N/A Scrooge (UK) 350 categories 87.56% 38.9% 2.5% 26.24% 9.9% 58.48% 97.1% 93.67% 2.8% 6.32% N/A N/A
  • 58. #Future Improvements • Utilize Image Features (in End-To-End model) • Utilize Entity Recognition to extract more features • Find ways to utilize more features (color, sizes etc.) • Categorical Self-Trained Embeddings • Experiment with newer solutions like “Transformer”
  • 59. #Contact Info Andreas Loupasakis • Email: alup@skroutz.gr • Kaggle: https://www.kaggle.com/andreaslup • Twitter: https://twitter.com/andy_lupo • LinkedIn: https://www.linkedin.com/in/andreas-loupasakis-06399a47