SlideShare a Scribd company logo
Nikhil	
  Ketkar	
  
Data	
  Science	
  @	
  Indix	
  	
  
16	
  July	
  2015	
  	
  
 	
  
Crawler	
  
Matching	
  
Product	
  
Pages	
  
Groups	
  of	
  	
  
Matching	
  URLS	
  
Focus	
  of	
  
the	
  Talk	
  
¡  Competitive	
  Landscape	
  
§  Who	
  are	
  your	
  competitors?	
  
§  How	
  are	
  they	
  pricing	
  products?	
  
§  What	
  other	
  products	
  do	
  they	
  carry?	
  
¡  Scale	
  
§  Products	
  
§  Sites	
  
§  Categories	
  
Store	
  
Product	
  
Match	
  
Matching	
  is	
  
Central	
  to	
  
answering	
  key	
  
questions	
  	
  in	
  
retail	
  analytics	
  
1.  Title	
  
2.  Image	
  URL	
  
3.  Price	
  
4.  Description	
  
5.  Tables	
  
	
  
Challenges:	
  Scale,	
  
Depth,	
  Diversity,	
  
Change	
  
DOM	
  Tree	
  
Title	
  or	
  Not:	
  Binary	
  
Classification	
  
	
   Class	
  
Imbalance	
  
DOM	
  Tree	
  
HTML	
  Features	
  
Visual	
  Features	
  
Random	
  Forest	
  
Model	
  
1.  Title	
  
2.  Image	
  URL	
  
3.  Price	
  
4.  Description	
  
5.  Tables	
  
	
  
Category	
  
Taxonomy	
  
Challenges:	
  Large	
  
Taxonomy,	
  Lack	
  of	
  Training	
  
Data,	
  Changes	
  in	
  Taxonomy	
  
Linear	
  SVM	
  
CNN	
  
Ensemble	
  
Breadcrumb	
  
Mapping	
  
Background	
  
Knowledge	
  
1.  Title	
  
2.  Image	
  URL	
  
3.  Price	
  
4.  Description	
  
5.  Tables	
  
	
  
Challenges:	
  Large	
  number	
  
of	
  attributes,	
  bad/missing	
  
data,	
  variability	
  
1.  Brand	
  
2.  Size	
  
3.  Color	
  
4.  Packs	
  
5.  …	
  
	
  
Schema	
  
Brand:Nike	
  
Brand:Reebok	
  
Brand:Nike	
   Color:Black/Neo	
  Lime	
  Total-­‐Crimson	
  
Sole:Rubber	
  
1.  Title	
  
2.  Image	
  URL	
  
3.  Price	
  
4.  Description	
  
5.  Tables	
  
	
  
Challenges:	
  No	
  single	
  
approach	
  works	
  well	
  
1.  Brand	
  
2.  Size	
  
3.  Color	
  
4.  Packs	
  
5.  …	
  
	
  
Category	
  
Enriched	
  Product	
  Record	
  
Merge	
  Groups	
  
Bucketing/	
  
Clustering	
  
Mass	
  Join	
   LSH	
  
Challenges:	
  Pairwise	
  
Distance	
  Computation,	
  
Match	
  at	
  a	
  Store	
  Constraint	
  
Store	
  
Product	
  
Match	
  
1.  Pairwise	
  Distance	
  Computation	
  
2.  Constrained	
  Clustering	
  
D(P, P0
)
1.  Title	
  BOW	
  
2.  Brand	
  
3.  Category	
  
4.  Attributes	
  
¡  Constraint	
  Type	
  
§  Must	
  Link	
  
§  Cannot	
  Link	
  
¡  Examples	
  
§  UPC	
  
§  MPN	
  
§  Match	
  at	
  a	
  Store	
  
	
  
Must	
  Link	
  
Cannot	
  
Link	
  
May	
  Link	
  
D(P, P0
)
Use	
  Constrained	
  
Clustering	
  
Parsing	
   Classification	
  
Attribute	
  
Extraction	
  
Blocking	
  
Match	
  
Inference	
  
HTML	
  
Product	
  
Record	
  
Classified	
  
Products	
  
Attributes	
  
Product	
  
Groups	
  
Matches	
  
Reported	
  
Actual	
  
Correct	
  
Correct	
  
Actual	
  
Reported	
  
¡  Precision	
  
§  Sample	
  and	
  Spot-­‐check	
  
	
  
¡  Recall	
  
§  Hard	
  to	
  estimate	
  
§  Rare	
  population	
  
§  Manually	
  search	
  products	
  on	
  
a	
  site	
  to	
  produce	
  blind	
  sets	
  
Lack	
  of	
  Ground	
  
Truth	
  is	
  the	
  
biggest	
  road	
  block	
  
Correct	
  
 	
  

More Related Content

Similar to Indix at Fifth Elephant 2015

12
1212
B2B Tech Website Competitive Assessment
B2B Tech Website Competitive AssessmentB2B Tech Website Competitive Assessment
B2B Tech Website Competitive Assessment
Rosetta Marketing
 
Competitive Reviews in Interaction Design
Competitive Reviews in Interaction DesignCompetitive Reviews in Interaction Design
Competitive Reviews in Interaction Design
Hans Põldoja
 
Microsoft Adverting Shopping Campaigns
Microsoft Adverting Shopping CampaignsMicrosoft Adverting Shopping Campaigns
Microsoft Adverting Shopping Campaigns
MSFTAdvertising
 
MDCE Presentaition 2015
MDCE Presentaition 2015MDCE Presentaition 2015
MDCE Presentaition 2015
Jennifer Robison
 
Creating Online Product Pages that Specifiers will Love
Creating Online Product Pages that Specifiers will LoveCreating Online Product Pages that Specifiers will Love
Creating Online Product Pages that Specifiers will Love
Kayley Bright
 
QM-008-Design for Six Sigma 1
QM-008-Design for Six Sigma 1QM-008-Design for Six Sigma 1
QM-008-Design for Six Sigma 1
handbook
 
Differences Problem: or why consistency in metadata is critical in the discov...
Differences Problem: or why consistency in metadata is critical in the discov...Differences Problem: or why consistency in metadata is critical in the discov...
Differences Problem: or why consistency in metadata is critical in the discov...
Shana McDanold
 
McDanold, "The Differences Problem: Or why Consistency in Metadata is Critica...
McDanold, "The Differences Problem: Or why Consistency in Metadata is Critica...McDanold, "The Differences Problem: Or why Consistency in Metadata is Critica...
McDanold, "The Differences Problem: Or why Consistency in Metadata is Critica...
National Information Standards Organization (NISO)
 
From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)
From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)
From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)
Four Kitchens
 
Product and Brand
Product and BrandProduct and Brand
Product and Brand
Mark Curphey
 
Chapter 10 crafting the brand positioning
Chapter 10   crafting the brand positioningChapter 10   crafting the brand positioning
Chapter 10 crafting the brand positioning
smumbahelp
 
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
Enterprise Ireland
 
UXDI Project 2: RetroLit.com Microsite
UXDI Project 2: RetroLit.com MicrositeUXDI Project 2: RetroLit.com Microsite
UXDI Project 2: RetroLit.com Microsite
Rachelle Jackson
 
0430 making your cooperative database names work harder
0430 making your cooperative database names work harder0430 making your cooperative database names work harder
0430 making your cooperative database names work harder
guest984e8f
 
2011.2.10 Marketing
2011.2.10 Marketing2011.2.10 Marketing
2011.2.10 Marketing
Stephan Langdon
 
From Paul S.
From Paul S.From Paul S.
From Paul S.
Paul Stratford
 
Denver Startup Week: 10 Common Website Mistakes and How to Fix Them
Denver Startup Week: 10 Common Website Mistakes and How to Fix ThemDenver Startup Week: 10 Common Website Mistakes and How to Fix Them
Denver Startup Week: 10 Common Website Mistakes and How to Fix Them
Alli Berry
 
Enterprise Ireland presentation - International seo and content June 2018
Enterprise Ireland  presentation - International seo and content   June 2018Enterprise Ireland  presentation - International seo and content   June 2018
Enterprise Ireland presentation - International seo and content June 2018
John Caldwell
 
Topic-oriented information architecture for the enterprise
Topic-oriented information architecture for the enterpriseTopic-oriented information architecture for the enterprise
Topic-oriented information architecture for the enterprise
Michael Priestley
 

Similar to Indix at Fifth Elephant 2015 (20)

12
1212
12
 
B2B Tech Website Competitive Assessment
B2B Tech Website Competitive AssessmentB2B Tech Website Competitive Assessment
B2B Tech Website Competitive Assessment
 
Competitive Reviews in Interaction Design
Competitive Reviews in Interaction DesignCompetitive Reviews in Interaction Design
Competitive Reviews in Interaction Design
 
Microsoft Adverting Shopping Campaigns
Microsoft Adverting Shopping CampaignsMicrosoft Adverting Shopping Campaigns
Microsoft Adverting Shopping Campaigns
 
MDCE Presentaition 2015
MDCE Presentaition 2015MDCE Presentaition 2015
MDCE Presentaition 2015
 
Creating Online Product Pages that Specifiers will Love
Creating Online Product Pages that Specifiers will LoveCreating Online Product Pages that Specifiers will Love
Creating Online Product Pages that Specifiers will Love
 
QM-008-Design for Six Sigma 1
QM-008-Design for Six Sigma 1QM-008-Design for Six Sigma 1
QM-008-Design for Six Sigma 1
 
Differences Problem: or why consistency in metadata is critical in the discov...
Differences Problem: or why consistency in metadata is critical in the discov...Differences Problem: or why consistency in metadata is critical in the discov...
Differences Problem: or why consistency in metadata is critical in the discov...
 
McDanold, "The Differences Problem: Or why Consistency in Metadata is Critica...
McDanold, "The Differences Problem: Or why Consistency in Metadata is Critica...McDanold, "The Differences Problem: Or why Consistency in Metadata is Critica...
McDanold, "The Differences Problem: Or why Consistency in Metadata is Critica...
 
From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)
From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)
From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)
 
Product and Brand
Product and BrandProduct and Brand
Product and Brand
 
Chapter 10 crafting the brand positioning
Chapter 10   crafting the brand positioningChapter 10   crafting the brand positioning
Chapter 10 crafting the brand positioning
 
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
 
UXDI Project 2: RetroLit.com Microsite
UXDI Project 2: RetroLit.com MicrositeUXDI Project 2: RetroLit.com Microsite
UXDI Project 2: RetroLit.com Microsite
 
0430 making your cooperative database names work harder
0430 making your cooperative database names work harder0430 making your cooperative database names work harder
0430 making your cooperative database names work harder
 
2011.2.10 Marketing
2011.2.10 Marketing2011.2.10 Marketing
2011.2.10 Marketing
 
From Paul S.
From Paul S.From Paul S.
From Paul S.
 
Denver Startup Week: 10 Common Website Mistakes and How to Fix Them
Denver Startup Week: 10 Common Website Mistakes and How to Fix ThemDenver Startup Week: 10 Common Website Mistakes and How to Fix Them
Denver Startup Week: 10 Common Website Mistakes and How to Fix Them
 
Enterprise Ireland presentation - International seo and content June 2018
Enterprise Ireland  presentation - International seo and content   June 2018Enterprise Ireland  presentation - International seo and content   June 2018
Enterprise Ireland presentation - International seo and content June 2018
 
Topic-oriented information architecture for the enterprise
Topic-oriented information architecture for the enterpriseTopic-oriented information architecture for the enterprise
Topic-oriented information architecture for the enterprise
 

Recently uploaded

Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 

Recently uploaded (20)

Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 

Indix at Fifth Elephant 2015

  • 1. Nikhil  Ketkar   Data  Science  @  Indix     16  July  2015    
  • 3.
  • 4.
  • 5. Crawler   Matching   Product   Pages   Groups  of     Matching  URLS   Focus  of   the  Talk  
  • 6. ¡  Competitive  Landscape   §  Who  are  your  competitors?   §  How  are  they  pricing  products?   §  What  other  products  do  they  carry?   ¡  Scale   §  Products   §  Sites   §  Categories   Store   Product   Match   Matching  is   Central  to   answering  key   questions    in   retail  analytics  
  • 7. 1.  Title   2.  Image  URL   3.  Price   4.  Description   5.  Tables     Challenges:  Scale,   Depth,  Diversity,   Change   DOM  Tree   Title  or  Not:  Binary   Classification     Class   Imbalance  
  • 8. DOM  Tree   HTML  Features   Visual  Features   Random  Forest   Model  
  • 9. 1.  Title   2.  Image  URL   3.  Price   4.  Description   5.  Tables     Category   Taxonomy   Challenges:  Large   Taxonomy,  Lack  of  Training   Data,  Changes  in  Taxonomy  
  • 10. Linear  SVM   CNN   Ensemble   Breadcrumb   Mapping   Background   Knowledge  
  • 11. 1.  Title   2.  Image  URL   3.  Price   4.  Description   5.  Tables     Challenges:  Large  number   of  attributes,  bad/missing   data,  variability   1.  Brand   2.  Size   3.  Color   4.  Packs   5.  …     Schema  
  • 13. Brand:Nike   Color:Black/Neo  Lime  Total-­‐Crimson   Sole:Rubber  
  • 14. 1.  Title   2.  Image  URL   3.  Price   4.  Description   5.  Tables     Challenges:  No  single   approach  works  well   1.  Brand   2.  Size   3.  Color   4.  Packs   5.  …     Category   Enriched  Product  Record  
  • 15. Merge  Groups   Bucketing/   Clustering   Mass  Join   LSH  
  • 16. Challenges:  Pairwise   Distance  Computation,   Match  at  a  Store  Constraint   Store   Product   Match   1.  Pairwise  Distance  Computation   2.  Constrained  Clustering  
  • 17. D(P, P0 ) 1.  Title  BOW   2.  Brand   3.  Category   4.  Attributes  
  • 18. ¡  Constraint  Type   §  Must  Link   §  Cannot  Link   ¡  Examples   §  UPC   §  MPN   §  Match  at  a  Store     Must  Link   Cannot   Link   May  Link   D(P, P0 ) Use  Constrained   Clustering  
  • 19. Parsing   Classification   Attribute   Extraction   Blocking   Match   Inference   HTML   Product   Record   Classified   Products   Attributes   Product   Groups   Matches  
  • 20. Reported   Actual   Correct   Correct   Actual   Reported   ¡  Precision   §  Sample  and  Spot-­‐check     ¡  Recall   §  Hard  to  estimate   §  Rare  population   §  Manually  search  products  on   a  site  to  produce  blind  sets   Lack  of  Ground   Truth  is  the   biggest  road  block   Correct  
  • 21.
  • 22.