SlideShare a Scribd company logo
1 of 22
Nikhil Ketkar
Data Science @ Indix
16 July 2015
Crawler
Matching
Product
Pages
Groups of
Matching URLS
Focus of
theTalk
 Competitive Landscape
 Who are your competitors?
 How are they pricing products?
 What other products do they carry?
 Scale
 Products
 Sites
 Categories
Store
Product
Match
Matching is
Central to
answering key
questions in
retail analytics
1. Title
2. Image URL
3. Price
4. Description
5. Tables
Challenges: Scale,
Depth, Diversity,
Change
DOMTree
Title or Not: Binary
Classification
Class
Imbalance
DOMTree
HTML Features
Visual Features
Random Forest
Model
1. Title
2. Image URL
3. Price
4. Description
5. Tables
Category
Taxonomy
Challenges: Large
Taxonomy, Lack ofTraining
Data, Changes inTaxonomy
Linear SVM
CNN
Ensemble
Breadcrumb
Mapping
Background
Knowledge
1. Title
2. Image URL
3. Price
4. Description
5. Tables
Challenges: Large number
of attributes, bad/missing
data, variability
1. Brand
2. Size
3. Color
4. Packs
5. …
Schema
Brand:Nike
Brand:Reebok
Brand:Nike Color:Black/Neo LimeTotal-Crimson
Sole:Rubber
1. Title
2. Image URL
3. Price
4. Description
5. Tables
Challenges: No single
approach works well
1. Brand
2. Size
3. Color
4. Packs
5. …
Category
Enriched Product Record
Merge Groups
Bucketing/
Clustering
Mass Join LSH
Challenges: Pairwise
Distance Computation,
Match at a Store Constraint
Store
Product
Match
1. Pairwise Distance Computation
2. Constrained Clustering
1. Title BOW
2. Brand
3. Category
4. Attributes
 ConstraintType
 Must Link
 Cannot Link
 Examples
 UPC
 MPN
 Match at a Store
Must Link
Cannot
Link
May Link
Use Constrained
Clustering
Parsing Classification
Attribute
Extraction
Blocking
Match
Inference
HTML
Product
Record
Classified
Products
Attributes
Product
Groups
Matches
Reported
Actual
Correct
Correct
Actual
Reported
 Precision
 Sample and Spot-check
 Recall
 Hard to estimate
 Rare population
 Manually search products on
a site to produce blind sets
Lack of Ground
Truth is the
biggest road block
Correct
Indix5thElephantDraft2
Indix5thElephantDraft2

More Related Content

Similar to Indix5thElephantDraft2

Electronic commerce meets the semantic web
Electronic commerce meets the semantic webElectronic commerce meets the semantic web
Electronic commerce meets the semantic webMaheshBabu435
 
Toc08 Goldthwaite Digitizing Your Backlist
Toc08 Goldthwaite Digitizing Your BacklistToc08 Goldthwaite Digitizing Your Backlist
Toc08 Goldthwaite Digitizing Your Backlisttoc
 
E Marketing Week10
E Marketing Week10E Marketing Week10
E Marketing Week10Stephen Dann
 
Information Architecture for Retail Web Sites: Lessons from the Field
Information Architecture for Retail Web Sites: Lessons from the FieldInformation Architecture for Retail Web Sites: Lessons from the Field
Information Architecture for Retail Web Sites: Lessons from the FieldNick Berry
 
How to Create Infographic Masterclass by Venngage
How to Create Infographic Masterclass by VenngageHow to Create Infographic Masterclass by Venngage
How to Create Infographic Masterclass by VenngageVenngage
 
Topic-oriented information architecture for the enterprise
Topic-oriented information architecture for the enterpriseTopic-oriented information architecture for the enterprise
Topic-oriented information architecture for the enterpriseMichael Priestley
 
UK University Website Visibility - responding to the quirks of the crawler
UK University Website Visibility - responding to the quirks of the crawlerUK University Website Visibility - responding to the quirks of the crawler
UK University Website Visibility - responding to the quirks of the crawlerIWMW
 
Dublin Core In Practice
Dublin Core In PracticeDublin Core In Practice
Dublin Core In PracticeMarcia Zeng
 
NVIDIA RecSys Summit 2022 - EoR
NVIDIA RecSys Summit 2022 - EoRNVIDIA RecSys Summit 2022 - EoR
NVIDIA RecSys Summit 2022 - EoRBryan Bischof
 
Information Architecture Exposing the Secret Sauce for Success
Information Architecture Exposing the Secret Sauce for Success Information Architecture Exposing the Secret Sauce for Success
Information Architecture Exposing the Secret Sauce for Success Baltimore SharePoint (BSPUG)
 
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)Denodo
 
How to be a Good Machine Learning PM by Google Product Manager
How to be a Good Machine Learning PM by Google Product ManagerHow to be a Good Machine Learning PM by Google Product Manager
How to be a Good Machine Learning PM by Google Product ManagerProduct School
 
IWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise ItIWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise ItIWMW
 
Tibco spotfire online training
Tibco spotfire online trainingTibco spotfire online training
Tibco spotfire online trainingmindmajixtrainings
 

Similar to Indix5thElephantDraft2 (20)

Bis 245
Bis 245Bis 245
Bis 245
 
12
1212
12
 
12
1212
12
 
Electronic commerce meets the semantic web
Electronic commerce meets the semantic webElectronic commerce meets the semantic web
Electronic commerce meets the semantic web
 
Aiim motorola-taxo-integration-03-15-10-cg
Aiim motorola-taxo-integration-03-15-10-cgAiim motorola-taxo-integration-03-15-10-cg
Aiim motorola-taxo-integration-03-15-10-cg
 
Toc08 Goldthwaite Digitizing Your Backlist
Toc08 Goldthwaite Digitizing Your BacklistToc08 Goldthwaite Digitizing Your Backlist
Toc08 Goldthwaite Digitizing Your Backlist
 
E Marketing Week10
E Marketing Week10E Marketing Week10
E Marketing Week10
 
Information Architecture for Retail Web Sites: Lessons from the Field
Information Architecture for Retail Web Sites: Lessons from the FieldInformation Architecture for Retail Web Sites: Lessons from the Field
Information Architecture for Retail Web Sites: Lessons from the Field
 
How to Create Infographic Masterclass by Venngage
How to Create Infographic Masterclass by VenngageHow to Create Infographic Masterclass by Venngage
How to Create Infographic Masterclass by Venngage
 
Data Warehousing And Business Intelligence Training
Data Warehousing And Business Intelligence TrainingData Warehousing And Business Intelligence Training
Data Warehousing And Business Intelligence Training
 
Topic-oriented information architecture for the enterprise
Topic-oriented information architecture for the enterpriseTopic-oriented information architecture for the enterprise
Topic-oriented information architecture for the enterprise
 
UK University Website Visibility - responding to the quirks of the crawler
UK University Website Visibility - responding to the quirks of the crawlerUK University Website Visibility - responding to the quirks of the crawler
UK University Website Visibility - responding to the quirks of the crawler
 
Dublin Core In Practice
Dublin Core In PracticeDublin Core In Practice
Dublin Core In Practice
 
NVIDIA RecSys Summit 2022 - EoR
NVIDIA RecSys Summit 2022 - EoRNVIDIA RecSys Summit 2022 - EoR
NVIDIA RecSys Summit 2022 - EoR
 
Information Architecture Exposing the Secret Sauce for Success
Information Architecture Exposing the Secret Sauce for Success Information Architecture Exposing the Secret Sauce for Success
Information Architecture Exposing the Secret Sauce for Success
 
Id stratergy
Id   stratergyId   stratergy
Id stratergy
 
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)
 
How to be a Good Machine Learning PM by Google Product Manager
How to be a Good Machine Learning PM by Google Product ManagerHow to be a Good Machine Learning PM by Google Product Manager
How to be a Good Machine Learning PM by Google Product Manager
 
IWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise ItIWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise It
 
Tibco spotfire online training
Tibco spotfire online trainingTibco spotfire online training
Tibco spotfire online training
 

Indix5thElephantDraft2