SlideShare a Scribd company logo
Shop Vertical
Classification
@
Arthur Prévot
Meetup Machine Learning – Toronto – March 1st 2016
Background
• Large ecommerce platform
• 240K+ current customers
• Many more shops created (churned or
didn’t make it to customer status)
Problem
● No information about their industry in most cases
1st solution
● ask them
2nd solution
● We have html product descriptions for each shop
● We have labelled data (mechanical turk)
Classifier
Context
• Started during a Shopify Hack Day
• Pursued as a side project at work
• Used sk-learn and
• Moved to Spark MLlib for full scale testing
and production
• Now in production
Product Description
Getting Label Data
• Asked Amazon Mechanical Turkers to assess 80K stores
• Having to choose among 15 verticals
• Involved hundreds of turkers
80K shops
Shop Aggregated product data
1 “Nice octopolo shirt !…”
2 “Nice hat and nice shirt …”
3 “Set of <b> tires </b> …”
4 “Beef and more beef…”
5 “Tire set for bikes”
... ...
Input
80K shops
Shop Text
1 “nice octopolo shirt…”
2 “nice hat and nice shirt…”
3 “set tire…”
4 “beef beef…”
5 “tire set bike”
... ...
Cleaning
• HTML code removed
• Stop word removed
• Words stemmed
Shops nice octopolo shirt hat set tires beef bike ... label
1 1 1 1 ... Apparel
2 2 1 1 ... Apparel
3 1 1 ... Auto
4 2 … Food
5 1 1 1 … Auto
... ... ... ... … … … … … ... …
10K words (8 in ex)
Term Frequency
80Kshops
Joining
mech
turk
Model
• Few quick tests using sklearn and settled
on Naïve Bayes
Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
15labels
Naïve Bayes Model
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apprel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
What and why
• These are the model parameters
• Needed as input to the prediction formula
!"#$%&'#$	)*+,, = +"./+01	! &*	 	$2&)
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
What and why
! &*	 	$2&) =	
4 15 ∗4 781	 	15)
4(781)
∝ ! &* ∗ ! $2&	 	&*)
= ! &* ∗ ! ;$<	 	&*) * ! ;$=	 	&*) * … * ! ;$>	 	&*)
(Bayes Theorem)
with conditional independence
assumption, actually violated..
denominator not important to compare likelihoods
!"#$%&'#$	)*+,, = +"./+01	! &*	 	$2&)
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apparel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
Numerical Limitation
• Multiplying many values close to 0 -> float underflow
! &*	 	$2&) ∝ ! &* ∗ ! ;$<	 	&*) * ! ;$=	 	&*) * … * ! ;$>	 	&*)
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 Log(P(..)) Log(P(..)) Log(P(.
.))
Log(P(..)) Log(P(..)) Log(P(..
))
Log(P(..)) Log(P(..))
Apparel Log(P(..))
3, 5 Log(P(..)) Log(P(..)) Log(P(.
.))
Log(P(..)) Log(P(..)) Log(P(..
))
Log(P(..)) Log(P(..))
Auto Log(P(..))
4 Log(P(..)) Log(P(..)) Log(P(.
.))
Log(P(..)) Log(P(..)) Log(P(..
))
Log(P(..)) Log(P(..))
Food Log(P(..))
Numerical limitation
?2. ! &*	 	$2&) ∝ log ! &* + log( ! ;$<	 	&*)) + log	(! ;$=	 	&*)) + … + log(! ;$>	 	&*))
• Way around: take log -> leads to summation instead of multiplication
• No impact on comparisons across classes
! &*	 	$2&) ∝ ! &* ∗ ! ;$<	 	&*) * ! ;$=	 	&*) * … * ! ;$>	 	&*) From before, so:
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 P (nice |
apparel)
P (octopolo |
apparel)
P (shirt
| aprel)
P (hat |
apparel)
P (set |
apparel)
P (tires |
aprel)
P (beef |
apparel)
P (bike |
apparel)
Apparel P(apprel)
3, 5 P (nice |
auto)
P (octopolo |
auto)
P (shirt
| auto)
P (hat ||
auto)
P (set ||
auto)
P (tires
|| auto)
P (beef |
auto)
P (bike |
auto)
Auto P(auto)
4 P (nice |
food)
P (octopolo |
food)
P (shirt
| food)
P (hat ||
food
P (set ||
food)
P (tires
|| food)
P (beef |
food)
P (bike |
food)
Food P(food)
Getting cell probabilities
! ;$>	 	&*) =	
DEF	GH
∑ DEFKLEMN
Dealing with P(wd|cl)=0
which makes P(cl|doc)=0
regardless of other words
!(&*) =	
DEF
D
≈	
DEF	GH	P<
∑ (DEFP<)KLEMN
=	
DEF	GH	P<
∑ (DEF)PQ81RSKLEMN
Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 3 + 1
7 + 8
1 + 1
7 + 8
2 + 1
7 + 8
1 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
Apparel 2
5
3, 5 Auto
4 Food
15labels
Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 3 + 1
7 + 8
1 + 1
7 + 8
1 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
Apparel 2
5
3, 5 0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
2 + 1
5 + 8
2 + 1
5 + 8
0 + 1
5 + 8
1 + 1
5 + 8
Auto 2
5
4 Food
15labels
Shops nice octopolo shirt hat set tires beef bike label
1 1 1 1 Apparel
2 2 1 1 Apparel
3 1 1 Auto
4 2 Food
5 1 1 1 Auto
80Kshops
Shops nice octopolo shirt hat set tires beef bike label priors
1, 2 3 + 1
7 + 8
1 + 1
7 + 8
1 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
0 + 1
7 + 8
Apparel 2
5
3, 5 0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
0 + 1
5 + 8
2 + 1
5 + 8
2 + 1
5 + 8
0 + 1
5 + 8
1 + 1
5 + 8
Auto 2
5
4 0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
0 + 1
2 + 8
2 + 1
2 + 8
0 + 1
2 + 8
Food 1
5
15labels
class LabeledDataFilter():
...
class Featurizer():
...
class Trainer()
...
class Evaluator()
...
class Predictor()
...
class verticalPredictor():
use Featurizer()
use Predictor()
...
product_data
Training job (every 7 days) Prediction job (every day)
model
accuracy
product_data
shop+industry
model
Code
Change in Training Set
• Start of home card
• Allowed asking for Industry in
a voluntary way
• Quickly grew to 50K shops
• Advantage: growing over time
• Issue: training set is not fully
random
Shop Name
Shop URL
Shop Address
Shop City
…
Shop Predicted Industry
…
Shop Dimension
In the Data Warehouse
Updated daily
Results
Shops top
category
turker 1 turker2 turker 3
Chive Apparel Apparel Apparel Art
Lackers Sports Sports Apparel Sports
Tesla Auto Auto Auto Sports
... ... ... ...
60-80%
Results
Shops top
category
turker 1 turker2 turker 3 algo
top1
algo
top2
algo
top3
Chive Apparel Apparel Apparel Art Apparel Sport Art
Lackers Sports Sports Apparel Sports Sports Apparel Food
Tesla Auto Auto Auto Sports Fashion Auto Electro
... ... ... ...
60-80% ~65%
Results
Shops top
category
turker 1 turker2 turker 3 algo
top1
algo
top2
algo
top3
Chive Apparel Apparel Apparel Art Apparel Sport Art
Lackers Sports Sports Apparel Sports Sports Apparel Food
Tesla Auto Auto Auto Sports unknown Auto Electro
... ... ... ...
90%
~75%
Business Use
Management or product teams:
• What are the biggest industries per shop count, per sales made?
• How does that evolve over time ?
Theme team:
• We want to develop new themes for a given vertical, can we see the
top stores in this vertical to understand trends ?
Event team:
• We want to be part of an event in the music business, can we get
interesting shops in this field ?
Could be improved
●More metrics: Add multiclass precision/recall
○Now available in mllib
●Better performances: Rerun for combination
of parameters
○Also added recently to mllib but missing some
components
DEMO
THE END

More Related Content

Viewers also liked

3Com 3C13635-US
3Com 3C13635-US3Com 3C13635-US
3Com 3C13635-US
savomir
 
Mini clase sobre el acoso escolar
Mini clase sobre el acoso escolarMini clase sobre el acoso escolar
Mini clase sobre el acoso escolar
Nicolle Sanchez
 
LiveViewGPS Hours Of Service (HOS) Presentation
LiveViewGPS Hours Of Service (HOS) PresentationLiveViewGPS Hours Of Service (HOS) Presentation
LiveViewGPS Hours Of Service (HOS) Presentation
LiveViewGPS Inc
 
Učící se společnost 3
Učící se společnost 3Učící se společnost 3
Učící se společnost 3
Michal Černý
 
MonoGame extensions & engines
MonoGame extensions & enginesMonoGame extensions & engines
MonoGame extensions & engines
Simon Jackson
 
Дизайн презентаций для Epic Skills
Дизайн презентаций для Epic SkillsДизайн презентаций для Epic Skills
Дизайн презентаций для Epic SkillsKate
 
3Com 7030-10021
3Com 7030-100213Com 7030-10021
3Com 7030-10021
savomir
 
8 de marzo Día Internacional de La Mujer
8 de marzo Día Internacional de La Mujer8 de marzo Día Internacional de La Mujer
8 de marzo Día Internacional de La Mujer
ITCHA - Agape El Salvador
 
Ventajas y desventajas de calameo y slideshare
Ventajas y desventajas de calameo y slideshareVentajas y desventajas de calameo y slideshare
Ventajas y desventajas de calameo y slideshare
emerson arismendi
 
Divagas flutuações
Divagas flutuaçõesDivagas flutuações
Divagas flutuações
Jose Maia
 
La comercializadora de productos
La comercializadora de productosLa comercializadora de productos
La comercializadora de productos
Luis Hernando Herrera Uribe
 
Comunicación escrita
Comunicación escritaComunicación escrita
Comunicación escrita
Celeste09nov
 
Bases de datos de libre acceso
Bases de datos de libre accesoBases de datos de libre acceso
Bases de datos de libre acceso
Anibal Torres
 
Učící se společnost 2
Učící se společnost 2Učící se společnost 2
Učící se společnost 2
Michal Černý
 
Učící se společnost 1
Učící se společnost 1Učící se společnost 1
Učící se společnost 1
Michal Černý
 
Histeria 1 (madame bovary)
Histeria 1 (madame bovary)Histeria 1 (madame bovary)
Histeria 1 (madame bovary)
larissanasantos
 
Bloqueos interfasciales ecoguiados
Bloqueos interfasciales ecoguiadosBloqueos interfasciales ecoguiados
Bloqueos interfasciales ecoguiados
castignanimauro
 
Argumentos a favor de la existencia de dios
Argumentos a favor de la existencia de diosArgumentos a favor de la existencia de dios
Argumentos a favor de la existencia de dios
AriMaya900
 
Tabelas hash
Tabelas hashTabelas hash

Viewers also liked (19)

3Com 3C13635-US
3Com 3C13635-US3Com 3C13635-US
3Com 3C13635-US
 
Mini clase sobre el acoso escolar
Mini clase sobre el acoso escolarMini clase sobre el acoso escolar
Mini clase sobre el acoso escolar
 
LiveViewGPS Hours Of Service (HOS) Presentation
LiveViewGPS Hours Of Service (HOS) PresentationLiveViewGPS Hours Of Service (HOS) Presentation
LiveViewGPS Hours Of Service (HOS) Presentation
 
Učící se společnost 3
Učící se společnost 3Učící se společnost 3
Učící se společnost 3
 
MonoGame extensions & engines
MonoGame extensions & enginesMonoGame extensions & engines
MonoGame extensions & engines
 
Дизайн презентаций для Epic Skills
Дизайн презентаций для Epic SkillsДизайн презентаций для Epic Skills
Дизайн презентаций для Epic Skills
 
3Com 7030-10021
3Com 7030-100213Com 7030-10021
3Com 7030-10021
 
8 de marzo Día Internacional de La Mujer
8 de marzo Día Internacional de La Mujer8 de marzo Día Internacional de La Mujer
8 de marzo Día Internacional de La Mujer
 
Ventajas y desventajas de calameo y slideshare
Ventajas y desventajas de calameo y slideshareVentajas y desventajas de calameo y slideshare
Ventajas y desventajas de calameo y slideshare
 
Divagas flutuações
Divagas flutuaçõesDivagas flutuações
Divagas flutuações
 
La comercializadora de productos
La comercializadora de productosLa comercializadora de productos
La comercializadora de productos
 
Comunicación escrita
Comunicación escritaComunicación escrita
Comunicación escrita
 
Bases de datos de libre acceso
Bases de datos de libre accesoBases de datos de libre acceso
Bases de datos de libre acceso
 
Učící se společnost 2
Učící se společnost 2Učící se společnost 2
Učící se společnost 2
 
Učící se společnost 1
Učící se společnost 1Učící se společnost 1
Učící se společnost 1
 
Histeria 1 (madame bovary)
Histeria 1 (madame bovary)Histeria 1 (madame bovary)
Histeria 1 (madame bovary)
 
Bloqueos interfasciales ecoguiados
Bloqueos interfasciales ecoguiadosBloqueos interfasciales ecoguiados
Bloqueos interfasciales ecoguiados
 
Argumentos a favor de la existencia de dios
Argumentos a favor de la existencia de diosArgumentos a favor de la existencia de dios
Argumentos a favor de la existencia de dios
 
Tabelas hash
Tabelas hashTabelas hash
Tabelas hash
 

Recently uploaded

一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 

Recently uploaded (20)

一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 

Shop vertical classification - Meetup Presentation

  • 1. Shop Vertical Classification @ Arthur Prévot Meetup Machine Learning – Toronto – March 1st 2016
  • 2. Background • Large ecommerce platform • 240K+ current customers • Many more shops created (churned or didn’t make it to customer status)
  • 3.
  • 4. Problem ● No information about their industry in most cases 1st solution ● ask them 2nd solution ● We have html product descriptions for each shop ● We have labelled data (mechanical turk) Classifier
  • 5. Context • Started during a Shopify Hack Day • Pursued as a side project at work • Used sk-learn and • Moved to Spark MLlib for full scale testing and production • Now in production
  • 7. Getting Label Data • Asked Amazon Mechanical Turkers to assess 80K stores • Having to choose among 15 verticals • Involved hundreds of turkers
  • 8. 80K shops Shop Aggregated product data 1 “Nice octopolo shirt !…” 2 “Nice hat and nice shirt …” 3 “Set of <b> tires </b> …” 4 “Beef and more beef…” 5 “Tire set for bikes” ... ... Input
  • 9. 80K shops Shop Text 1 “nice octopolo shirt…” 2 “nice hat and nice shirt…” 3 “set tire…” 4 “beef beef…” 5 “tire set bike” ... ... Cleaning • HTML code removed • Stop word removed • Words stemmed
  • 10. Shops nice octopolo shirt hat set tires beef bike ... label 1 1 1 1 ... Apparel 2 2 1 1 ... Apparel 3 1 1 ... Auto 4 2 … Food 5 1 1 1 … Auto ... ... ... ... … … … … … ... … 10K words (8 in ex) Term Frequency 80Kshops Joining mech turk
  • 11. Model • Few quick tests using sklearn and settled on Naïve Bayes
  • 12. Shops nice octopolo shirt hat set tires beef bike label 1 1 1 1 Apparel 2 2 1 1 Apparel 3 1 1 Auto 4 2 Food 5 1 1 1 Auto 80Kshops Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 P (nice | apparel) P (octopolo | apparel) P (shirt | aprel) P (hat | apparel) P (set | apparel) P (tires | aprel) P (beef | apparel) P (bike | apparel) Apparel P(apparel) 3, 5 P (nice | auto) P (octopolo | auto) P (shirt | auto) P (hat || auto) P (set || auto) P (tires || auto) P (beef | auto) P (bike | auto) Auto P(auto) 4 P (nice | food) P (octopolo | food) P (shirt | food) P (hat || food P (set || food) P (tires || food) P (beef | food) P (bike | food) Food P(food) 15labels Naïve Bayes Model
  • 13. Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 P (nice | apparel) P (octopolo | apparel) P (shirt | aprel) P (hat | apparel) P (set | apparel) P (tires | aprel) P (beef | apparel) P (bike | apparel) Apparel P(apprel) 3, 5 P (nice | auto) P (octopolo | auto) P (shirt | auto) P (hat || auto) P (set || auto) P (tires || auto) P (beef | auto) P (bike | auto) Auto P(auto) 4 P (nice | food) P (octopolo | food) P (shirt | food) P (hat || food P (set || food) P (tires || food) P (beef | food) P (bike | food) Food P(food) What and why • These are the model parameters • Needed as input to the prediction formula !"#$%&'#$ )*+,, = +"./+01 ! &* $2&)
  • 14. Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 P (nice | apparel) P (octopolo | apparel) P (shirt | aprel) P (hat | apparel) P (set | apparel) P (tires | aprel) P (beef | apparel) P (bike | apparel) Apparel P(apparel) 3, 5 P (nice | auto) P (octopolo | auto) P (shirt | auto) P (hat || auto) P (set || auto) P (tires || auto) P (beef | auto) P (bike | auto) Auto P(auto) 4 P (nice | food) P (octopolo | food) P (shirt | food) P (hat || food P (set || food) P (tires || food) P (beef | food) P (bike | food) Food P(food) What and why ! &* $2&) = 4 15 ∗4 781 15) 4(781) ∝ ! &* ∗ ! $2& &*) = ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*) (Bayes Theorem) with conditional independence assumption, actually violated.. denominator not important to compare likelihoods !"#$%&'#$ )*+,, = +"./+01 ! &* $2&)
  • 15. Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 P (nice | apparel) P (octopolo | apparel) P (shirt | aprel) P (hat | apparel) P (set | apparel) P (tires | aprel) P (beef | apparel) P (bike | apparel) Apparel P(apparel) 3, 5 P (nice | auto) P (octopolo | auto) P (shirt | auto) P (hat || auto) P (set || auto) P (tires || auto) P (beef | auto) P (bike | auto) Auto P(auto) 4 P (nice | food) P (octopolo | food) P (shirt | food) P (hat || food P (set || food) P (tires || food) P (beef | food) P (bike | food) Food P(food) Numerical Limitation • Multiplying many values close to 0 -> float underflow ! &* $2&) ∝ ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*)
  • 16. Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 Log(P(..)) Log(P(..)) Log(P(. .)) Log(P(..)) Log(P(..)) Log(P(.. )) Log(P(..)) Log(P(..)) Apparel Log(P(..)) 3, 5 Log(P(..)) Log(P(..)) Log(P(. .)) Log(P(..)) Log(P(..)) Log(P(.. )) Log(P(..)) Log(P(..)) Auto Log(P(..)) 4 Log(P(..)) Log(P(..)) Log(P(. .)) Log(P(..)) Log(P(..)) Log(P(.. )) Log(P(..)) Log(P(..)) Food Log(P(..)) Numerical limitation ?2. ! &* $2&) ∝ log ! &* + log( ! ;$< &*)) + log (! ;$= &*)) + … + log(! ;$> &*)) • Way around: take log -> leads to summation instead of multiplication • No impact on comparisons across classes ! &* $2&) ∝ ! &* ∗ ! ;$< &*) * ! ;$= &*) * … * ! ;$> &*) From before, so:
  • 17. Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 P (nice | apparel) P (octopolo | apparel) P (shirt | aprel) P (hat | apparel) P (set | apparel) P (tires | aprel) P (beef | apparel) P (bike | apparel) Apparel P(apprel) 3, 5 P (nice | auto) P (octopolo | auto) P (shirt | auto) P (hat || auto) P (set || auto) P (tires || auto) P (beef | auto) P (bike | auto) Auto P(auto) 4 P (nice | food) P (octopolo | food) P (shirt | food) P (hat || food P (set || food) P (tires || food) P (beef | food) P (bike | food) Food P(food) Getting cell probabilities ! ;$> &*) = DEF GH ∑ DEFKLEMN Dealing with P(wd|cl)=0 which makes P(cl|doc)=0 regardless of other words !(&*) = DEF D ≈ DEF GH P< ∑ (DEFP<)KLEMN = DEF GH P< ∑ (DEF)PQ81RSKLEMN
  • 18. Shops nice octopolo shirt hat set tires beef bike label 1 1 1 1 Apparel 2 2 1 1 Apparel 3 1 1 Auto 4 2 Food 5 1 1 1 Auto 80Kshops Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 3 + 1 7 + 8 1 + 1 7 + 8 2 + 1 7 + 8 1 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 Apparel 2 5 3, 5 Auto 4 Food 15labels
  • 19. Shops nice octopolo shirt hat set tires beef bike label 1 1 1 1 Apparel 2 2 1 1 Apparel 3 1 1 Auto 4 2 Food 5 1 1 1 Auto 80Kshops Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 3 + 1 7 + 8 1 + 1 7 + 8 1 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 Apparel 2 5 3, 5 0 + 1 5 + 8 0 + 1 5 + 8 0 + 1 5 + 8 0 + 1 5 + 8 2 + 1 5 + 8 2 + 1 5 + 8 0 + 1 5 + 8 1 + 1 5 + 8 Auto 2 5 4 Food 15labels
  • 20. Shops nice octopolo shirt hat set tires beef bike label 1 1 1 1 Apparel 2 2 1 1 Apparel 3 1 1 Auto 4 2 Food 5 1 1 1 Auto 80Kshops Shops nice octopolo shirt hat set tires beef bike label priors 1, 2 3 + 1 7 + 8 1 + 1 7 + 8 1 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 0 + 1 7 + 8 Apparel 2 5 3, 5 0 + 1 5 + 8 0 + 1 5 + 8 0 + 1 5 + 8 0 + 1 5 + 8 2 + 1 5 + 8 2 + 1 5 + 8 0 + 1 5 + 8 1 + 1 5 + 8 Auto 2 5 4 0 + 1 2 + 8 0 + 1 2 + 8 0 + 1 2 + 8 0 + 1 2 + 8 0 + 1 2 + 8 0 + 1 2 + 8 2 + 1 2 + 8 0 + 1 2 + 8 Food 1 5 15labels
  • 21. class LabeledDataFilter(): ... class Featurizer(): ... class Trainer() ... class Evaluator() ... class Predictor() ... class verticalPredictor(): use Featurizer() use Predictor() ... product_data Training job (every 7 days) Prediction job (every day) model accuracy product_data shop+industry model Code
  • 22. Change in Training Set • Start of home card • Allowed asking for Industry in a voluntary way • Quickly grew to 50K shops • Advantage: growing over time • Issue: training set is not fully random
  • 23. Shop Name Shop URL Shop Address Shop City … Shop Predicted Industry … Shop Dimension In the Data Warehouse Updated daily
  • 24. Results Shops top category turker 1 turker2 turker 3 Chive Apparel Apparel Apparel Art Lackers Sports Sports Apparel Sports Tesla Auto Auto Auto Sports ... ... ... ... 60-80%
  • 25. Results Shops top category turker 1 turker2 turker 3 algo top1 algo top2 algo top3 Chive Apparel Apparel Apparel Art Apparel Sport Art Lackers Sports Sports Apparel Sports Sports Apparel Food Tesla Auto Auto Auto Sports Fashion Auto Electro ... ... ... ... 60-80% ~65%
  • 26. Results Shops top category turker 1 turker2 turker 3 algo top1 algo top2 algo top3 Chive Apparel Apparel Apparel Art Apparel Sport Art Lackers Sports Sports Apparel Sports Sports Apparel Food Tesla Auto Auto Auto Sports unknown Auto Electro ... ... ... ... 90% ~75%
  • 27. Business Use Management or product teams: • What are the biggest industries per shop count, per sales made? • How does that evolve over time ? Theme team: • We want to develop new themes for a given vertical, can we see the top stores in this vertical to understand trends ? Event team: • We want to be part of an event in the music business, can we get interesting shops in this field ?
  • 28. Could be improved ●More metrics: Add multiclass precision/recall ○Now available in mllib ●Better performances: Rerun for combination of parameters ○Also added recently to mllib but missing some components
  • 29. DEMO