SlideShare a Scribd company logo
1 of 1
Download to read offline
En#ty	
  matching	
  of	
  ecommerce	
  offers	
  
Paul	
  Puget	
  
	
  	
  	
  	
  Objec#ves	
  
	
  	
  	
  Methodology	
  
•  Iden#fy	
  if	
  two	
  webpages	
  present	
  offers	
  of	
  the	
  same	
  
product.	
  	
  
•  Define	
  a	
  methodology	
  to	
  compare	
  html	
  pages	
  of	
  
ecommerce	
  offers.	
  
•  Respect	
  context	
  constraints.	
  	
   This	
  is	
  one	
  example	
  of	
  two	
  different	
  webpages	
  represen#ng	
  
similar	
  offers	
  
I.	
  Parsing	
  
•  From	
  HTML	
  pages	
  to	
  product	
  informa#on	
  (name,	
  descrip#on,	
  
image,	
  …).	
  
•  Extensive	
  use	
  of	
  LXML	
  libraries	
  to	
  query	
  HTML	
  via	
  a	
  language	
  
deriva#ng	
  from	
  xpath.	
  	
  
Name:	
  Crème	
  avene	
  40mL	
  	
  
Image:	
  discount.fr/prodim.jpg	
  
descrip5on:	
  This	
  cream	
  will	
  have	
  
an	
  immediate	
  effect	
  on	
  …	
  
From	
  html	
  to	
  json	
  product	
  fields	
  	
  
II.	
  Features	
  extrac5on	
  	
  
•  Extract	
  and	
  normalize	
  explicit	
  features	
  from	
  product	
  data	
  
•  First	
  clean	
  and	
  tokenize	
  text	
  using	
  text	
  cleaning	
  techniques	
  
•  Then	
  extract	
  data	
  based	
  on	
  dynamically	
  built	
  dic#onnaries	
  
and	
  	
  context.	
  
Cream	
  
Extrac#on	
  and	
  normalisa#on	
  process	
  of	
  a	
  simple	
  3	
  words	
  string	
  
JPG	
   40mL	
  
Manufacturer:	
  JPG	
  
Volume:	
  40mL	
  
III.	
  Features	
  matching	
  	
  
•  From	
  the	
  features	
  we	
  previously	
  extracted	
  we	
  compute	
  a	
  
serie	
  of	
  matching	
  scores.	
  
•  Two	
  types	
  of	
  matchers	
  were	
  mainly	
  used.	
  
	
  	
  	
  	
  Conclusion	
  and	
  perspec#ves	
  
Boolean	
  matching	
  is	
  based	
  on	
  a	
  strict	
  equality,	
  it	
  can	
  be	
  of	
  one	
  or	
  more	
  of	
  
these	
  three	
  subtypes:	
  
•  Nega#ve:	
  a	
  nega#ve	
  result	
  means	
  the	
  offers	
  are	
  different	
  
(ex:	
  volume,	
  sku,	
  manufacturer)	
  
•  Posi#ve:	
  a	
  posi#ve	
  result	
  means	
  the	
  offers	
  are	
  the	
  same	
  
(only	
  sku	
  is	
  in	
  this	
  case)	
  
•  Neutral:	
  neither	
  match	
  or	
  not	
  match	
  allows	
  to	
  conclude	
  
Con5nuous	
  matching	
  gives	
  a	
  score	
  between	
  0	
  and	
  1	
  depending	
  on	
  
similarity	
  of	
  features.	
  
•  Price:	
  absolute	
  and	
  rela#ve	
  difference	
  
•  Name:	
  	
  tokens	
  differences	
  +	
  jaro_winkler	
  difference	
  
(jellyfish	
  package)	
  
•  Images:	
  Color	
  comparison	
  (numpy	
  +	
  scipy)	
  
Manufacturer:	
  Jean-­‐Paul	
  Gaul#er	
  
Volume:	
  0.04L	
  
Extrac#on	
  
Extrac#on	
   Normaliza#on	
  
Normaliza#on	
  
•  Results	
  of	
  classifica#on	
  accuracy	
  superior	
  to	
  recent	
  li^erature,	
  who	
  do	
  not	
  go	
  
beyond	
  80%	
  accuracy.	
  	
  
•  Methodology	
  is	
  not	
  specific	
  to	
  one	
  sector,	
  most	
  li^erature	
  studies	
  being	
  tested	
  on	
  
hi-­‐tech	
  products.	
  	
  
•  However	
  results	
  are	
  dependent	
  on	
  the	
  two	
  first	
  parts	
  (parsing	
  and	
  extrac#on)	
  which	
  
may	
  require	
  manual	
  work.	
  	
  
•  For	
  further	
  improvements	
  features	
  engineering	
  seems	
  to	
  be	
  the	
  direc#on	
  that	
  could	
  
bring	
  most	
  improvements.	
  
•  Using	
  more	
  advanced	
  seman#c	
  techniques	
  such	
  as	
  the	
  ones	
  implemented	
  in	
  NLTK	
  
and	
  shape	
  comparison	
  techniques	
  with	
  scikit	
  image	
  would	
  be	
  next	
  steps.	
  
IV.a	
  Web	
  offer	
  Matching	
  main	
  scoring	
  technique	
  
•  The	
  problem	
  of	
  matching	
  web	
  offers	
  is	
  modeled	
  as	
  a	
  
classifica#on	
  problem,	
  classifying	
  pairs	
  of	
  web	
  offers	
  as	
  valid	
  or	
  
invalid	
  pairs.	
  
•  A	
  dataset	
  of	
  pairs	
  is	
  created	
  using	
  boolean	
  posi#ve	
  matching	
  	
  
and	
  completed	
  by	
  manual	
  matching.	
  
•  The	
  model	
  which	
  proved	
  to	
  be	
  the	
  most	
  accurate	
  is	
  the	
  
decision	
  tree	
  classifier	
  as	
  implemented	
  in	
  scikit-­‐learn	
  
IV.b	
  Web	
  offer	
  Matching	
  Op5misa5ons	
  
•  Nega#ve	
  matchings	
  allow	
  via	
  pandas	
  dataframe	
  opera#ons	
  to	
  
eliminate	
  most	
  nega#ve	
  pairs.	
  This	
  gains	
  lots	
  of	
  computa#onal	
  
#me.	
  
•  When	
  comparing	
  two	
  ecommerce	
  catalogues	
  we	
  can	
  improve	
  
accuracy	
  by	
  using	
  the	
  unicity	
  of	
  products	
  hypotheses.	
  Indeed,	
  in	
  
this	
  case	
  we	
  can	
  use	
  an	
  assignment	
  algorithm	
  to	
  choose	
  best	
  
pairs.	
  
Classifica#on	
  score	
  depending	
  on	
  the	
  por#on	
  of	
  classified	
  
pairs	
  (defined	
  using	
  probability	
  classifica#on).	
  Test	
  was	
  
conducted	
  on	
  a	
  dataset	
  of	
  50000	
  weboffers	
  pairs	
  	
  

More Related Content

Similar to Entity matching of web offers, from html to similarity score.

Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
 
Building an AI and ML Model Using KNIME and Python.pptx
Building an AI and ML Model Using KNIME and Python.pptxBuilding an AI and ML Model Using KNIME and Python.pptx
Building an AI and ML Model Using KNIME and Python.pptxssuser448ad3
 
Common Problems in Hyperparameter Optimization
Common Problems in Hyperparameter OptimizationCommon Problems in Hyperparameter Optimization
Common Problems in Hyperparameter OptimizationSigOpt
 
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017MLconf
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Alok Singh
 
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...Christopher Sneed, MSDS, PMP, CSPO
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIVijayananda Mohire
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning SystemsAnuj Gupta
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NETDev Raj Gautam
 
An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...
An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...
An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...Blue Elephant Consulting
 
House price prediction
House price predictionHouse price prediction
House price predictionKaranseth30
 
You have Selenium... Now what?
You have Selenium... Now what?You have Selenium... Now what?
You have Selenium... Now what?Great Wide Open
 
housing price prediction.pptx
housing price prediction.pptxhousing price prediction.pptx
housing price prediction.pptxJINALVASOYA2
 
How to get Automated Testing "Done"
How to get Automated Testing "Done"How to get Automated Testing "Done"
How to get Automated Testing "Done"TEST Huddle
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Praveen Penumathsa
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 

Similar to Entity matching of web offers, from html to similarity score. (20)

Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
 
Building an AI and ML Model Using KNIME and Python.pptx
Building an AI and ML Model Using KNIME and Python.pptxBuilding an AI and ML Model Using KNIME and Python.pptx
Building an AI and ML Model Using KNIME and Python.pptx
 
Common Problems in Hyperparameter Optimization
Common Problems in Hyperparameter OptimizationCommon Problems in Hyperparameter Optimization
Common Problems in Hyperparameter Optimization
 
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
 
An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...
An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...
An Introduction To Python - Problem Solving: Flowcharts & Test Cases, Boolean...
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
You have Selenium... Now what?
You have Selenium... Now what?You have Selenium... Now what?
You have Selenium... Now what?
 
housing price prediction.pptx
housing price prediction.pptxhousing price prediction.pptx
housing price prediction.pptx
 
Pre-Report.pptx
Pre-Report.pptxPre-Report.pptx
Pre-Report.pptx
 
How to get Automated Testing "Done"
How to get Automated Testing "Done"How to get Automated Testing "Done"
How to get Automated Testing "Done"
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Software Design principales
Software Design principalesSoftware Design principales
Software Design principales
 

Recently uploaded

Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 

Recently uploaded (20)

Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 

Entity matching of web offers, from html to similarity score.

  • 1. En#ty  matching  of  ecommerce  offers   Paul  Puget          Objec#ves        Methodology   •  Iden#fy  if  two  webpages  present  offers  of  the  same   product.     •  Define  a  methodology  to  compare  html  pages  of   ecommerce  offers.   •  Respect  context  constraints.     This  is  one  example  of  two  different  webpages  represen#ng   similar  offers   I.  Parsing   •  From  HTML  pages  to  product  informa#on  (name,  descrip#on,   image,  …).   •  Extensive  use  of  LXML  libraries  to  query  HTML  via  a  language   deriva#ng  from  xpath.     Name:  Crème  avene  40mL     Image:  discount.fr/prodim.jpg   descrip5on:  This  cream  will  have   an  immediate  effect  on  …   From  html  to  json  product  fields     II.  Features  extrac5on     •  Extract  and  normalize  explicit  features  from  product  data   •  First  clean  and  tokenize  text  using  text  cleaning  techniques   •  Then  extract  data  based  on  dynamically  built  dic#onnaries   and    context.   Cream   Extrac#on  and  normalisa#on  process  of  a  simple  3  words  string   JPG   40mL   Manufacturer:  JPG   Volume:  40mL   III.  Features  matching     •  From  the  features  we  previously  extracted  we  compute  a   serie  of  matching  scores.   •  Two  types  of  matchers  were  mainly  used.          Conclusion  and  perspec#ves   Boolean  matching  is  based  on  a  strict  equality,  it  can  be  of  one  or  more  of   these  three  subtypes:   •  Nega#ve:  a  nega#ve  result  means  the  offers  are  different   (ex:  volume,  sku,  manufacturer)   •  Posi#ve:  a  posi#ve  result  means  the  offers  are  the  same   (only  sku  is  in  this  case)   •  Neutral:  neither  match  or  not  match  allows  to  conclude   Con5nuous  matching  gives  a  score  between  0  and  1  depending  on   similarity  of  features.   •  Price:  absolute  and  rela#ve  difference   •  Name:    tokens  differences  +  jaro_winkler  difference   (jellyfish  package)   •  Images:  Color  comparison  (numpy  +  scipy)   Manufacturer:  Jean-­‐Paul  Gaul#er   Volume:  0.04L   Extrac#on   Extrac#on   Normaliza#on   Normaliza#on   •  Results  of  classifica#on  accuracy  superior  to  recent  li^erature,  who  do  not  go   beyond  80%  accuracy.     •  Methodology  is  not  specific  to  one  sector,  most  li^erature  studies  being  tested  on   hi-­‐tech  products.     •  However  results  are  dependent  on  the  two  first  parts  (parsing  and  extrac#on)  which   may  require  manual  work.     •  For  further  improvements  features  engineering  seems  to  be  the  direc#on  that  could   bring  most  improvements.   •  Using  more  advanced  seman#c  techniques  such  as  the  ones  implemented  in  NLTK   and  shape  comparison  techniques  with  scikit  image  would  be  next  steps.   IV.a  Web  offer  Matching  main  scoring  technique   •  The  problem  of  matching  web  offers  is  modeled  as  a   classifica#on  problem,  classifying  pairs  of  web  offers  as  valid  or   invalid  pairs.   •  A  dataset  of  pairs  is  created  using  boolean  posi#ve  matching     and  completed  by  manual  matching.   •  The  model  which  proved  to  be  the  most  accurate  is  the   decision  tree  classifier  as  implemented  in  scikit-­‐learn   IV.b  Web  offer  Matching  Op5misa5ons   •  Nega#ve  matchings  allow  via  pandas  dataframe  opera#ons  to   eliminate  most  nega#ve  pairs.  This  gains  lots  of  computa#onal   #me.   •  When  comparing  two  ecommerce  catalogues  we  can  improve   accuracy  by  using  the  unicity  of  products  hypotheses.  Indeed,  in   this  case  we  can  use  an  assignment  algorithm  to  choose  best   pairs.   Classifica#on  score  depending  on  the  por#on  of  classified   pairs  (defined  using  probability  classifica#on).  Test  was   conducted  on  a  dataset  of  50000  weboffers  pairs