This poster is presenting a methodology for entity matching of product web offers.
It was presented during the 8th Euroscipy conference in end of august of 2015. This poster is presenting Pricing Assistant’s recent work on product matching. The goal was to create a tool capable of determining if two web pages are selling the same product. Our approach combines various techniques from the fields of image analysis, semantic analysis and machine learning. The technique had great results and outperformed existing literature in fields such as skincare, cycling equipment and sporting goods.
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Entity matching of web offers, from html to similarity score.
1. En#ty
matching
of
ecommerce
offers
Paul
Puget
Objec#ves
Methodology
• Iden#fy
if
two
webpages
present
offers
of
the
same
product.
• Define
a
methodology
to
compare
html
pages
of
ecommerce
offers.
• Respect
context
constraints.
This
is
one
example
of
two
different
webpages
represen#ng
similar
offers
I.
Parsing
• From
HTML
pages
to
product
informa#on
(name,
descrip#on,
image,
…).
• Extensive
use
of
LXML
libraries
to
query
HTML
via
a
language
deriva#ng
from
xpath.
Name:
Crème
avene
40mL
Image:
discount.fr/prodim.jpg
descrip5on:
This
cream
will
have
an
immediate
effect
on
…
From
html
to
json
product
fields
II.
Features
extrac5on
• Extract
and
normalize
explicit
features
from
product
data
• First
clean
and
tokenize
text
using
text
cleaning
techniques
• Then
extract
data
based
on
dynamically
built
dic#onnaries
and
context.
Cream
Extrac#on
and
normalisa#on
process
of
a
simple
3
words
string
JPG
40mL
Manufacturer:
JPG
Volume:
40mL
III.
Features
matching
• From
the
features
we
previously
extracted
we
compute
a
serie
of
matching
scores.
• Two
types
of
matchers
were
mainly
used.
Conclusion
and
perspec#ves
Boolean
matching
is
based
on
a
strict
equality,
it
can
be
of
one
or
more
of
these
three
subtypes:
• Nega#ve:
a
nega#ve
result
means
the
offers
are
different
(ex:
volume,
sku,
manufacturer)
• Posi#ve:
a
posi#ve
result
means
the
offers
are
the
same
(only
sku
is
in
this
case)
• Neutral:
neither
match
or
not
match
allows
to
conclude
Con5nuous
matching
gives
a
score
between
0
and
1
depending
on
similarity
of
features.
• Price:
absolute
and
rela#ve
difference
• Name:
tokens
differences
+
jaro_winkler
difference
(jellyfish
package)
• Images:
Color
comparison
(numpy
+
scipy)
Manufacturer:
Jean-‐Paul
Gaul#er
Volume:
0.04L
Extrac#on
Extrac#on
Normaliza#on
Normaliza#on
• Results
of
classifica#on
accuracy
superior
to
recent
li^erature,
who
do
not
go
beyond
80%
accuracy.
• Methodology
is
not
specific
to
one
sector,
most
li^erature
studies
being
tested
on
hi-‐tech
products.
• However
results
are
dependent
on
the
two
first
parts
(parsing
and
extrac#on)
which
may
require
manual
work.
• For
further
improvements
features
engineering
seems
to
be
the
direc#on
that
could
bring
most
improvements.
• Using
more
advanced
seman#c
techniques
such
as
the
ones
implemented
in
NLTK
and
shape
comparison
techniques
with
scikit
image
would
be
next
steps.
IV.a
Web
offer
Matching
main
scoring
technique
• The
problem
of
matching
web
offers
is
modeled
as
a
classifica#on
problem,
classifying
pairs
of
web
offers
as
valid
or
invalid
pairs.
• A
dataset
of
pairs
is
created
using
boolean
posi#ve
matching
and
completed
by
manual
matching.
• The
model
which
proved
to
be
the
most
accurate
is
the
decision
tree
classifier
as
implemented
in
scikit-‐learn
IV.b
Web
offer
Matching
Op5misa5ons
• Nega#ve
matchings
allow
via
pandas
dataframe
opera#ons
to
eliminate
most
nega#ve
pairs.
This
gains
lots
of
computa#onal
#me.
• When
comparing
two
ecommerce
catalogues
we
can
improve
accuracy
by
using
the
unicity
of
products
hypotheses.
Indeed,
in
this
case
we
can
use
an
assignment
algorithm
to
choose
best
pairs.
Classifica#on
score
depending
on
the
por#on
of
classified
pairs
(defined
using
probability
classifica#on).
Test
was
conducted
on
a
dataset
of
50000
weboffers
pairs