SlideShare a Scribd company logo
1 of 22
Nikhil	
  Ketkar	
  
Data	
  Science	
  @	
  Indix	
  	
  
16	
  July	
  2015	
  	
  
 	
  
Crawler	
  
Matching	
  
Product	
  
Pages	
  
Groups	
  of	
  	
  
Matching	
  URLS	
  
Focus	
  of	
  
the	
  Talk	
  
¡  Competitive	
  Landscape	
  
§  Who	
  are	
  your	
  competitors?	
  
§  How	
  are	
  they	
  pricing	
  products?	
  
§  What	
  other	
  products	
  do	
  they	
  carry?	
  
¡  Scale	
  
§  Products	
  
§  Sites	
  
§  Categories	
  
Store	
  
Product	
  
Match	
  
Matching	
  is	
  
Central	
  to	
  
answering	
  key	
  
questions	
  	
  in	
  
retail	
  analytics	
  
1.  Title	
  
2.  Image	
  URL	
  
3.  Price	
  
4.  Description	
  
5.  Tables	
  
	
  
Challenges:	
  Scale,	
  
Depth,	
  Diversity,	
  
Change	
  
DOM	
  Tree	
  
Title	
  or	
  Not:	
  Binary	
  
Classification	
  
	
   Class	
  
Imbalance	
  
DOM	
  Tree	
  
HTML	
  Features	
  
Visual	
  Features	
  
Random	
  Forest	
  
Model	
  
1.  Title	
  
2.  Image	
  URL	
  
3.  Price	
  
4.  Description	
  
5.  Tables	
  
	
  
Category	
  
Taxonomy	
  
Challenges:	
  Large	
  
Taxonomy,	
  Lack	
  of	
  Training	
  
Data,	
  Changes	
  in	
  Taxonomy	
  
Linear	
  SVM	
  
CNN	
  
Ensemble	
  
Breadcrumb	
  
Mapping	
  
Background	
  
Knowledge	
  
1.  Title	
  
2.  Image	
  URL	
  
3.  Price	
  
4.  Description	
  
5.  Tables	
  
	
  
Challenges:	
  Large	
  number	
  
of	
  attributes,	
  bad/missing	
  
data,	
  variability	
  
1.  Brand	
  
2.  Size	
  
3.  Color	
  
4.  Packs	
  
5.  …	
  
	
  
Schema	
  
Brand:Nike	
  
Brand:Reebok	
  
Brand:Nike	
   Color:Black/Neo	
  Lime	
  Total-­‐Crimson	
  
Sole:Rubber	
  
1.  Title	
  
2.  Image	
  URL	
  
3.  Price	
  
4.  Description	
  
5.  Tables	
  
	
  
Challenges:	
  No	
  single	
  
approach	
  works	
  well	
  
1.  Brand	
  
2.  Size	
  
3.  Color	
  
4.  Packs	
  
5.  …	
  
	
  
Category	
  
Enriched	
  Product	
  Record	
  
Merge	
  Groups	
  
Bucketing/	
  
Clustering	
  
Mass	
  Join	
   LSH	
  
Challenges:	
  Pairwise	
  
Distance	
  Computation,	
  
Match	
  at	
  a	
  Store	
  Constraint	
  
Store	
  
Product	
  
Match	
  
1.  Pairwise	
  Distance	
  Computation	
  
2.  Constrained	
  Clustering	
  
D(P, P0
)
1.  Title	
  BOW	
  
2.  Brand	
  
3.  Category	
  
4.  Attributes	
  
¡  Constraint	
  Type	
  
§  Must	
  Link	
  
§  Cannot	
  Link	
  
¡  Examples	
  
§  UPC	
  
§  MPN	
  
§  Match	
  at	
  a	
  Store	
  
	
  
Must	
  Link	
  
Cannot	
  
Link	
  
May	
  Link	
  
D(P, P0
)
Use	
  Constrained	
  
Clustering	
  
Parsing	
   Classification	
  
Attribute	
  
Extraction	
  
Blocking	
  
Match	
  
Inference	
  
HTML	
  
Product	
  
Record	
  
Classified	
  
Products	
  
Attributes	
  
Product	
  
Groups	
  
Matches	
  
Reported	
  
Actual	
  
Correct	
  
Correct	
  
Actual	
  
Reported	
  
¡  Precision	
  
§  Sample	
  and	
  Spot-­‐check	
  
	
  
¡  Recall	
  
§  Hard	
  to	
  estimate	
  
§  Rare	
  population	
  
§  Manually	
  search	
  products	
  on	
  
a	
  site	
  to	
  produce	
  blind	
  sets	
  
Lack	
  of	
  Ground	
  
Truth	
  is	
  the	
  
biggest	
  road	
  block	
  
Correct	
  
 	
  

More Related Content

Similar to Indix at Fifth Elephant 2015

B2B Tech Website Competitive Assessment
B2B Tech Website Competitive AssessmentB2B Tech Website Competitive Assessment
B2B Tech Website Competitive AssessmentRosetta Marketing
 
Competitive Reviews in Interaction Design
Competitive Reviews in Interaction DesignCompetitive Reviews in Interaction Design
Competitive Reviews in Interaction DesignHans Põldoja
 
Microsoft Adverting Shopping Campaigns
Microsoft Adverting Shopping CampaignsMicrosoft Adverting Shopping Campaigns
Microsoft Adverting Shopping CampaignsMSFTAdvertising
 
Creating Online Product Pages that Specifiers will Love
Creating Online Product Pages that Specifiers will LoveCreating Online Product Pages that Specifiers will Love
Creating Online Product Pages that Specifiers will LoveKayley Bright
 
QM-008-Design for Six Sigma 1
QM-008-Design for Six Sigma 1QM-008-Design for Six Sigma 1
QM-008-Design for Six Sigma 1handbook
 
Differences Problem: or why consistency in metadata is critical in the discov...
Differences Problem: or why consistency in metadata is critical in the discov...Differences Problem: or why consistency in metadata is critical in the discov...
Differences Problem: or why consistency in metadata is critical in the discov...Shana McDanold
 
From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)
From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)
From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)Four Kitchens
 
Chapter 10 crafting the brand positioning
Chapter 10   crafting the brand positioningChapter 10   crafting the brand positioning
Chapter 10 crafting the brand positioningsmumbahelp
 
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...Enterprise Ireland
 
UXDI Project 2: RetroLit.com Microsite
UXDI Project 2: RetroLit.com MicrositeUXDI Project 2: RetroLit.com Microsite
UXDI Project 2: RetroLit.com MicrositeRachelle Jackson
 
0430 making your cooperative database names work harder
0430 making your cooperative database names work harder0430 making your cooperative database names work harder
0430 making your cooperative database names work harderguest984e8f
 
Denver Startup Week: 10 Common Website Mistakes and How to Fix Them
Denver Startup Week: 10 Common Website Mistakes and How to Fix ThemDenver Startup Week: 10 Common Website Mistakes and How to Fix Them
Denver Startup Week: 10 Common Website Mistakes and How to Fix ThemAlli Berry
 
Enterprise Ireland presentation - International seo and content June 2018
Enterprise Ireland  presentation - International seo and content   June 2018Enterprise Ireland  presentation - International seo and content   June 2018
Enterprise Ireland presentation - International seo and content June 2018John Caldwell
 
Topic-oriented information architecture for the enterprise
Topic-oriented information architecture for the enterpriseTopic-oriented information architecture for the enterprise
Topic-oriented information architecture for the enterpriseMichael Priestley
 

Similar to Indix at Fifth Elephant 2015 (20)

12
1212
12
 
B2B Tech Website Competitive Assessment
B2B Tech Website Competitive AssessmentB2B Tech Website Competitive Assessment
B2B Tech Website Competitive Assessment
 
Competitive Reviews in Interaction Design
Competitive Reviews in Interaction DesignCompetitive Reviews in Interaction Design
Competitive Reviews in Interaction Design
 
Microsoft Adverting Shopping Campaigns
Microsoft Adverting Shopping CampaignsMicrosoft Adverting Shopping Campaigns
Microsoft Adverting Shopping Campaigns
 
MDCE Presentaition 2015
MDCE Presentaition 2015MDCE Presentaition 2015
MDCE Presentaition 2015
 
Creating Online Product Pages that Specifiers will Love
Creating Online Product Pages that Specifiers will LoveCreating Online Product Pages that Specifiers will Love
Creating Online Product Pages that Specifiers will Love
 
QM-008-Design for Six Sigma 1
QM-008-Design for Six Sigma 1QM-008-Design for Six Sigma 1
QM-008-Design for Six Sigma 1
 
Differences Problem: or why consistency in metadata is critical in the discov...
Differences Problem: or why consistency in metadata is critical in the discov...Differences Problem: or why consistency in metadata is critical in the discov...
Differences Problem: or why consistency in metadata is critical in the discov...
 
McDanold, "The Differences Problem: Or why Consistency in Metadata is Critica...
McDanold, "The Differences Problem: Or why Consistency in Metadata is Critica...McDanold, "The Differences Problem: Or why Consistency in Metadata is Critica...
McDanold, "The Differences Problem: Or why Consistency in Metadata is Critica...
 
From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)
From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)
From Photoshop to Drupal Theme (DrupalCon San Francisco 2010)
 
Product and Brand
Product and BrandProduct and Brand
Product and Brand
 
Chapter 10 crafting the brand positioning
Chapter 10   crafting the brand positioningChapter 10   crafting the brand positioning
Chapter 10 crafting the brand positioning
 
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
 
UXDI Project 2: RetroLit.com Microsite
UXDI Project 2: RetroLit.com MicrositeUXDI Project 2: RetroLit.com Microsite
UXDI Project 2: RetroLit.com Microsite
 
0430 making your cooperative database names work harder
0430 making your cooperative database names work harder0430 making your cooperative database names work harder
0430 making your cooperative database names work harder
 
2011.2.10 Marketing
2011.2.10 Marketing2011.2.10 Marketing
2011.2.10 Marketing
 
From Paul S.
From Paul S.From Paul S.
From Paul S.
 
Denver Startup Week: 10 Common Website Mistakes and How to Fix Them
Denver Startup Week: 10 Common Website Mistakes and How to Fix ThemDenver Startup Week: 10 Common Website Mistakes and How to Fix Them
Denver Startup Week: 10 Common Website Mistakes and How to Fix Them
 
Enterprise Ireland presentation - International seo and content June 2018
Enterprise Ireland  presentation - International seo and content   June 2018Enterprise Ireland  presentation - International seo and content   June 2018
Enterprise Ireland presentation - International seo and content June 2018
 
Topic-oriented information architecture for the enterprise
Topic-oriented information architecture for the enterpriseTopic-oriented information architecture for the enterprise
Topic-oriented information architecture for the enterprise
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Indix at Fifth Elephant 2015

  • 1. Nikhil  Ketkar   Data  Science  @  Indix     16  July  2015    
  • 3.
  • 4.
  • 5. Crawler   Matching   Product   Pages   Groups  of     Matching  URLS   Focus  of   the  Talk  
  • 6. ¡  Competitive  Landscape   §  Who  are  your  competitors?   §  How  are  they  pricing  products?   §  What  other  products  do  they  carry?   ¡  Scale   §  Products   §  Sites   §  Categories   Store   Product   Match   Matching  is   Central  to   answering  key   questions    in   retail  analytics  
  • 7. 1.  Title   2.  Image  URL   3.  Price   4.  Description   5.  Tables     Challenges:  Scale,   Depth,  Diversity,   Change   DOM  Tree   Title  or  Not:  Binary   Classification     Class   Imbalance  
  • 8. DOM  Tree   HTML  Features   Visual  Features   Random  Forest   Model  
  • 9. 1.  Title   2.  Image  URL   3.  Price   4.  Description   5.  Tables     Category   Taxonomy   Challenges:  Large   Taxonomy,  Lack  of  Training   Data,  Changes  in  Taxonomy  
  • 10. Linear  SVM   CNN   Ensemble   Breadcrumb   Mapping   Background   Knowledge  
  • 11. 1.  Title   2.  Image  URL   3.  Price   4.  Description   5.  Tables     Challenges:  Large  number   of  attributes,  bad/missing   data,  variability   1.  Brand   2.  Size   3.  Color   4.  Packs   5.  …     Schema  
  • 13. Brand:Nike   Color:Black/Neo  Lime  Total-­‐Crimson   Sole:Rubber  
  • 14. 1.  Title   2.  Image  URL   3.  Price   4.  Description   5.  Tables     Challenges:  No  single   approach  works  well   1.  Brand   2.  Size   3.  Color   4.  Packs   5.  …     Category   Enriched  Product  Record  
  • 15. Merge  Groups   Bucketing/   Clustering   Mass  Join   LSH  
  • 16. Challenges:  Pairwise   Distance  Computation,   Match  at  a  Store  Constraint   Store   Product   Match   1.  Pairwise  Distance  Computation   2.  Constrained  Clustering  
  • 17. D(P, P0 ) 1.  Title  BOW   2.  Brand   3.  Category   4.  Attributes  
  • 18. ¡  Constraint  Type   §  Must  Link   §  Cannot  Link   ¡  Examples   §  UPC   §  MPN   §  Match  at  a  Store     Must  Link   Cannot   Link   May  Link   D(P, P0 ) Use  Constrained   Clustering  
  • 19. Parsing   Classification   Attribute   Extraction   Blocking   Match   Inference   HTML   Product   Record   Classified   Products   Attributes   Product   Groups   Matches  
  • 20. Reported   Actual   Correct   Correct   Actual   Reported   ¡  Precision   §  Sample  and  Spot-­‐check     ¡  Recall   §  Hard  to  estimate   §  Rare  population   §  Manually  search  products  on   a  site  to  produce  blind  sets   Lack  of  Ground   Truth  is  the   biggest  road  block   Correct  
  • 21.
  • 22.