Entity matching of web offers, from html to similarity score.

•

0 likes•383 views

This poster is presenting a methodology for entity matching of product web offers. It was presented during the 8th Euroscipy conference in end of august of 2015. This poster is presenting Pricing Assistant’s recent work on product matching. The goal was to create a tool capable of determining if two web pages are selling the same product. Our approach combines various techniques from the fields of image analysis, semantic analysis and machine learning. The technique had great results and outperformed existing literature in fields such as skincare, cycling equipment and sporting goods.

Science

En#ty
matching
of
ecommerce
oﬀers

Paul
Puget

Objec#ves

Methodology

•  Iden#fy
if
two
webpages
present
oﬀers
of
the
same

product.

•  Deﬁne
a
methodology
to
compare
html
pages
of

ecommerce
oﬀers.

•  Respect
context
constraints.

This
is
one
example
of
two
diﬀerent
webpages
represen#ng

similar
oﬀers

I.
Parsing

•  From
HTML
pages
to
product
informa#on
(name,
descrip#on,

image,
…).

•  Extensive
use
of
LXML
libraries
to
query
HTML
via
a
language

deriva#ng
from
xpath.

Name:
Crème
avene
40mL

Image:
discount.fr/prodim.jpg

descrip5on:
This
cream
will
have

an
immediate
eﬀect
on
…

From
html
to
json
product
ﬁelds

II.
Features
extrac5on

•  Extract
and
normalize
explicit
features
from
product
data

•  First
clean
and
tokenize
text
using
text
cleaning
techniques

•  Then
extract
data
based
on
dynamically
built
dic#onnaries

and

context.

Cream

Extrac#on
and
normalisa#on
process
of
a
simple
3
words
string

JPG
40mL

Manufacturer:
JPG

Volume:
40mL

III.
Features
matching

•  From
the
features
we
previously
extracted
we
compute
a

serie
of
matching
scores.

•  Two
types
of
matchers
were
mainly
used.

Conclusion
and
perspec#ves

Boolean
matching
is
based
on
a
strict
equality,
it
can
be
of
one
or
more
of

these
three
subtypes:

•  Nega#ve:
a
nega#ve
result
means
the
oﬀers
are
diﬀerent

(ex:
volume,
sku,
manufacturer)

•  Posi#ve:
a
posi#ve
result
means
the
oﬀers
are
the
same

(only
sku
is
in
this
case)

•  Neutral:
neither
match
or
not
match
allows
to
conclude

Con5nuous
matching
gives
a
score
between
0
and
1
depending
on

similarity
of
features.

•  Price:
absolute
and
rela#ve
diﬀerence

•  Name:

tokens
diﬀerences
+
jaro_winkler
diﬀerence

(jellyﬁsh
package)

•  Images:
Color
comparison
(numpy
+
scipy)

Manufacturer:
Jean-‐Paul
Gaul#er

Volume:
0.04L

Extrac#on

Extrac#on
Normaliza#on

Normaliza#on

•  Results
of
classiﬁca#on
accuracy
superior
to
recent
li^erature,
who
do
not
go

beyond
80%
accuracy.

•  Methodology
is
not
speciﬁc
to
one
sector,
most
li^erature
studies
being
tested
on

hi-‐tech
products.

•  However
results
are
dependent
on
the
two
ﬁrst
parts
(parsing
and
extrac#on)
which

may
require
manual
work.

•  For
further
improvements
features
engineering
seems
to
be
the
direc#on
that
could

bring
most
improvements.

•  Using
more
advanced
seman#c
techniques
such
as
the
ones
implemented
in
NLTK

and
shape
comparison
techniques
with
scikit
image
would
be
next
steps.

IV.a
Web
oﬀer
Matching
main
scoring
technique

•  The
problem
of
matching
web
oﬀers
is
modeled
as
a

classiﬁca#on
problem,
classifying
pairs
of
web
oﬀers
as
valid
or

invalid
pairs.

•  A
dataset
of
pairs
is
created
using
boolean
posi#ve
matching

and
completed
by
manual
matching.

•  The
model
which
proved
to
be
the
most
accurate
is
the

decision
tree
classiﬁer
as
implemented
in
scikit-‐learn

IV.b
Web
oﬀer
Matching
Op5misa5ons

•  Nega#ve
matchings
allow
via
pandas
dataframe
opera#ons
to

eliminate
most
nega#ve
pairs.
This
gains
lots
of
computa#onal

#me.

•  When
comparing
two
ecommerce
catalogues
we
can
improve

accuracy
by
using
the
unicity
of
products
hypotheses.
Indeed,
in

this
case
we
can
use
an
assignment
algorithm
to
choose
best

pairs.

Classiﬁca#on
score
depending
on
the
por#on
of
classiﬁed

pairs
(deﬁned
using
probability
classiﬁca#on).
Test
was

conducted
on
a
dataset
of
50000
weboﬀers
pairs

Similar to Entity matching of web offers, from html to similarity score.

Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks

Building an AI and ML Model Using KNIME and Python.pptxssuser448ad3

Common Problems in Hyperparameter OptimizationSigOpt

Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017MLconf

Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Alok Singh

Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...Christopher Sneed, MSDS, PMP, CSPO

Key projects in AI, ML and Generative AIVijayananda Mohire

Building Continuous Learning SystemsAnuj Gupta

Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks

Machine Learning With ML.NETDev Raj Gautam

An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...Blue Elephant Consulting

House price predictionKaranseth30

You have Selenium... Now what?Great Wide Open

housing price prediction.pptxJINALVASOYA2

Pre-Report.pptxTANVIBENPATEL

How to get Automated Testing "Done"TEST Huddle

Generating test cases using UML Communication Diagram Praveen Penumathsa

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman

Software Design principalesABDEL RAHMAN KARIM

Similar to Entity matching of web offers, from html to similarity score. (20)

Applied Machine Learning for Ranking Products in an Ecommerce Setting

Building an AI and ML Model Using KNIME and Python.pptx

Common Problems in Hyperparameter Optimization

Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017

Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...

Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...

Key projects in AI, ML and Generative AI

Building Continuous Learning Systems

Augmenting Machine Learning with Databricks Labs AutoML Toolkit

Machine Learning With ML.NET

An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...

House price prediction

You have Selenium... Now what?

housing price prediction.pptx

Pre-Report.pptx

How to get Automated Testing "Done"

Generating test cases using UML Communication Diagram

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...

Software Design principales

Recently uploaded

Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1

GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji

Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25

Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India

A relative description on Sonoporation.pdfnehabiju2046

Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal

Isotopic evidence of long-lived volcanism on IoSérgio Sacani

Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA

Orientation, design and principles of polyhousejana861314

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani

Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl

Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji

Animal Communication- Auditory and Visual.pptxUmerFayaz5

Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani

Boyles law module in the grade 10 sciencefloriejanemacaya1

The Philosophy of ScienceUniversity of Hertfordshire

Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh

VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P

Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY

Recently uploaded (20)

Recombinant DNA technology (Immunological screening)

GFP in rDNA Technology (Biotechnology).pptx

Recombination DNA Technology (Nucleic Acid Hybridization )

Bentham & Hooker's Classification. along with the merits and demerits of the ...

A relative description on Sonoporation.pdf

Spermiogenesis or Spermateleosis or metamorphosis of spermatid

Isotopic evidence of long-lived volcanism on Io

Grafana in space: Monitoring Japan's SLIM moon lander in real time

Orientation, design and principles of polyhouse

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...

Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.

Luciferase in rDNA technology (biotechnology).pptx

Animal Communication- Auditory and Visual.pptx

Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b

Boyles law module in the grade 10 science

The Philosophy of Science

Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝

VIRUSES structure and classification ppt by Dr.Prince C P

Behavioral Disorder: Schizophrenia & it's Case Study.pdf

Entity matching of web offers, from html to similarity score.

1. En#ty matching of ecommerce offers Paul Puget Objec#ves Methodology •  Iden#fy if two webpages present offers of the same product. •  Define a methodology to compare html pages of ecommerce offers. •  Respect context constraints. This is one example of two different webpages represen#ng similar offers I. Parsing •  From HTML pages to product informa#on (name, descrip#on, image, …). •  Extensive use of LXML libraries to query HTML via a language deriva#ng from xpath. Name: Crème avene 40mL Image: discount.fr/prodim.jpg descrip5on: This cream will have an immediate effect on … From html to json product fields II. Features extrac5on •  Extract and normalize explicit features from product data •  First clean and tokenize text using text cleaning techniques •  Then extract data based on dynamically built dic#onnaries and context. Cream Extrac#on and normalisa#on process of a simple 3 words string JPG 40mL Manufacturer: JPG Volume: 40mL III. Features matching •  From the features we previously extracted we compute a serie of matching scores. •  Two types of matchers were mainly used. Conclusion and perspec#ves Boolean matching is based on a strict equality, it can be of one or more of these three subtypes: •  Nega#ve: a nega#ve result means the offers are different (ex: volume, sku, manufacturer) •  Posi#ve: a posi#ve result means the offers are the same (only sku is in this case) •  Neutral: neither match or not match allows to conclude Con5nuous matching gives a score between 0 and 1 depending on similarity of features. •  Price: absolute and rela#ve difference •  Name: tokens differences + jaro_winkler difference (jellyfish package) •  Images: Color comparison (numpy + scipy) Manufacturer: Jean-‐Paul Gaul#er Volume: 0.04L Extrac#on Extrac#on Normaliza#on Normaliza#on •  Results of classifica#on accuracy superior to recent liêrature, who do not go beyond 80% accuracy. •  Methodology is not specific to one sector, most liêrature studies being tested on hi-‐tech products. •  However results are dependent on the two first parts (parsing and extrac#on) which may require manual work. •  For further improvements features engineering seems to be the direc#on that could bring most improvements. •  Using more advanced seman#c techniques such as the ones implemented in NLTK and shape comparison techniques with scikit image would be next steps. IV.a Web offer Matching main scoring technique •  The problem of matching web offers is modeled as a classifica#on problem, classifying pairs of web offers as valid or invalid pairs. •  A dataset of pairs is created using boolean posi#ve matching and completed by manual matching. •  The model which proved to be the most accurate is the decision tree classifier as implemented in scikit-‐learn IV.b Web offer Matching Op5misa5ons •  Nega#ve matchings allow via pandas dataframe opera#ons to eliminate most nega#ve pairs. This gains lots of computa#onal #me. •  When comparing two ecommerce catalogues we can improve accuracy by using the unicity of products hypotheses. Indeed, in this case we can use an assignment algorithm to choose best pairs. Classifica#on score depending on the por#on of classified pairs (defined using probability classifica#on). Test was conducted on a dataset of 50000 weboffers pairs

Entity matching of web offers, from html to similarity score.

Recommended

Recommended

More Related Content

Similar to Entity matching of web offers, from html to similarity score.

Similar to Entity matching of web offers, from html to similarity score. (20)

Recently uploaded

Recently uploaded (20)

Entity matching of web offers, from html to similarity score.