SlideShare a Scribd company logo
Processes, Methods and Algorithms
4. E-Commerce Meetup @ Liip
Automated
Product Data
Preparation
“Big Data is not
the new oil. Small
and messy data,
properly refined, is
the new fuel:
consumable data”
Tobias Widmer
CTO @ Onedot.com
We are hiring!
How do you decide
where to buy online?
Challenges of Product Data
• Some manufacturers and data suppliers are still in the ERP era
• Missing universal and globally accepted standards and taxonomies
• Complex network of suppliers, distributors and consumers and publishers
• Extreme variety, from food to consumer electronics to construction supplies
• Product data feeds are often inconsistent, incomplete, corrupt, or all of of them
• There is no one-stop shop to get good product data from across categories
• Negotiations with data suppliers regarding data quality are cumbersome
Product data is one of the most complex kind of data to handle.
A new approach to tame product data is required:
1. Stop relying on quasi-standards and embrace the mess out there
2. Take whatever data you can from different data sources
3. Have a reliable and up-to-date product data model on your end
4. Automated data integration and harmonisation to the maximum
Introducing Product Data Model
A PDM simplifies governance and enables automation.
A PDM summarizes which products are described with which attributes.
A PDM contains the values needed to provide consistent search filters for end-users.
A PDM indicates which information suppliers should provide.
A PDM helps to set up a comprehensive product data governance with clear processes.
A PDM is required for automated product data preparation.
1
2
3
4
5
Automating Data Preparation
General data preparation in data science is an iterative
process with lots of potential to automate.
Identify Sources
Profile Data
Clean Data
Unify Data
Schema Mapping
Entity Resolution
Segmentation
Post-Processing
Potential for Automation
Low Medium High Highest
Data Mining
Probabilistic Methods
Machine Learning
Artificial Intelligence
DataPreparationProcess
Going back to University
Automated data preparation pipelines combine probabilistic
methods with advanced machine learning algorithms.
The Case for Machine Intelligence
Probabilistic and statistical methods embrace the fuzziness
of product data and allow continuous automated learning.
Manual
• Slow and error-prone
• Limited scalability
• Time-consuming education
Past
Rule-Based
• Costly implementation
• Expensive maintenance
• No automated learning
Present
Artificial Intelligence
• Minimal training using data
• Copes with changing data structure
• Continuous, automated learning
Future
Product Data Preparation
6 disciplines for successful product data preparation.
Integration
Automated Schema Mapping
Integrate new products
independent of attribute names
Golden Record Creation
Combine different data records
without the need for a primary key
Transformation
Attribute Normalization
Adapt supplier data to your
conventions, formats and styles
Attribute Extraction
Extract attributes from product
names and descriptions
Categorisation
Product Grouping
Group products according to your
classifications and dynamic groups
Product Variants Identification
Identify product variants across
categories and product groups
Schema Mapping
Translate schema mapping into a large-scale segmentation
problem using linear-time blocking algorithms.
Input Schema Target Schema Confidence Level Top 3 Values
address->postcode Zip 98% CF11 8TW, SN5 8YW, BS10 7UH
model Variant Model Variant 96% D V8 S TIPTRONIC S, Diesel Saloon, Coupe GT4
seatsNumber Number of Seats 96% 5, 4, 2
brandColor Exterior Brand Colour 95% White, Rosso Corsa, Jet Black Metallic
m_Unit Mileage Unit 94% miles, km
carOptions Car Options 92% 21-inch 911 Turbo II wheel wit…, 20-inch RS Spyder Deisgn wheel…
Type Model 90% Cayenne, 911, Ghibli
car category Car Type 89% g_km_emission
Golden Record Generation
Using entity resolution and data fusion algorithms based on
relevant weighted attributes to unify product information.
Source 1
Source 2
Strategy: Aggregation
Strategy: Authority
Strategy: Min, Max, Avg.,…
Name Colour Size Description ...
T-shirt Red XS This product is
very light
E
Name Colour Size Description ...
T-shirt S Light product E
Name Colour Size Description ...
T-shirt Red S This product is very
light
E
Name Colour Size Description ...
T-shirt Red XS, S This product is very
light, Light product
E
Name Colour Size Description ...
T-shirt Red XS This product is very
light
E
Attribute Normalisation
Algorithms based on independent Bayes networks and word
embedding classify arbitrary colours into colour families.
Input: Supplier free-text colours Output: Normalised colour families
Colour
natur beige
aquamarine
olive creative
Swiss Cow style
Manhattan Grau
elegant gold
Colour Colour Family
natur beige Light Brown
aquamarine Blue
olive creative Dark Green
egg yolk Yellow
Manhattan Grau Grey
elegant gold Gold
Attribute Extraction
Extract product attributes from unstructured text and put
them into the target data schema using deep learning.
Input: Supplier product catalogues Output: Extracted product attributes
Product Name Model Storage (GB) Colour
Apple iPhone 5S 32 Silver
Apple iPhone 7 64 Gold
Apple iPhone 7 128 Space Grey
Apple iPhone 8 64 Gold
Apple iPhone 8 256 Silver
Apple iPhone 8 256 Gold
Product Description
Apple iPhone 5S 32GB Silver
Apple iPhone 7 64gb gold
Apple iPhone 7 128 Space grey
Apple iPhone 8 64 gigabyte goold
Apple iPhone 8 256 Gb silber
Apple iPhone eight 256 giga-byte Gold
Product Categorisation
Using hierarchical, independent Bayes networks to categorise
products in a product category tree.
SKU ID Product Description Predicted Category Confidence Level
21045696 Sanitär-Kreuzschlüssel 200x200mm Chrom Screwdrivers 95%
21045696 Sanitär-Kreuzschlüssel 200x200mm Chrom Screwdrivers 95%
10031686 Bosch Rollenauflage PTA 1000… Filling Stations 93%
10031686 Bosch Rollenauflage PTA 1000… Filling Stations 93%
10031686 Bosch Rollenauflage PTA 1000… Filling Stations 93%
21023947 Mehrzweckhalter 3tlg. verstellb. m. 12Klammer Mounting 86%
94905095 Präz.-Zentrierg. Centro 6-125mm Halmer… Drilling 95%
94905095 Präz.-Zentrierg. Centro 6-125mm Halmer Drilling 95%
Product Variant Identification
Generating product variants by segmenting product groups
and aggregating the resulting product group attributes.
Input: Supplier product catalogue Output: Product variants
Category ID Manufacturer Name
Notebook 422102 Lenovo Lenovo Notebook
ThinkPad Yoga 900-13
Notebook 422101 Lenovo Lenovo Notebook
ThinkPad Yoga 900-13
Notebook 376675 Lenovo Lenovo Notebook Yoga
900-13 Silber
Notebook 370921 Lenovo Lenovo Notebook Yoga
900-13 Silber
Category Attribute Examples
Notebook Colour Champagne, Silver, Blue
Notebook Storage Capacity 128 GB, 256 GB, 512 GB
Notebook Display Resolution 1366x768 (WXGA), 1440x900 (WXGA+)
Notebook Usage Business, Consumer, Gaming
Notebook Connectivity 4G, WiFi, Bluetooth
Notebook Processor Family Intel Core i5, Intel Core i7
Conclusion
• Using a Product Data Model (PDM) is essential for automated product data preparation
• Adequate data governance helps keeping the PDM in shape
• Most steps in product data preparation can be automated using probabilistic and
statistical approaches
• Advanced machine learning techniques allow a data pipeline to adapt to the changing
data and learn from user feedback
We have shown a few ideas to make automated data
preparation a time- and cost-saving reality.
Q&A
onedot.com

More Related Content

What's hot

2 six sigma
2  six sigma2  six sigma
Msa training
Msa trainingMsa training
Msa training
Jitesh Gaurav
 
Business Operational Challenges Powerpoint Presentation Slides
Business Operational Challenges Powerpoint Presentation SlidesBusiness Operational Challenges Powerpoint Presentation Slides
Business Operational Challenges Powerpoint Presentation Slides
SlideTeam
 
Company Overview Powerpoint Presentation Slides
Company Overview Powerpoint Presentation SlidesCompany Overview Powerpoint Presentation Slides
Company Overview Powerpoint Presentation Slides
SlideTeam
 
How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?
Thanakrit Lersmethasakul
 
(BDT205) Your First Big Data Application On AWS
(BDT205) Your First Big Data Application On AWS(BDT205) Your First Big Data Application On AWS
(BDT205) Your First Big Data Application On AWS
Amazon Web Services
 
Rahat Yasir: Enterprise Data & AI Strategy & Platform Designing
Rahat Yasir: Enterprise Data & AI Strategy & Platform DesigningRahat Yasir: Enterprise Data & AI Strategy & Platform Designing
Rahat Yasir: Enterprise Data & AI Strategy & Platform Designing
Lviv Startup Club
 
Statistical process control ppt @ bec doms
Statistical process control ppt @ bec domsStatistical process control ppt @ bec doms
Statistical process control ppt @ bec doms
Babasab Patil
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
AmmarAhmedSiddiqui2
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification
303Computing
 
FMEA training (AIAG VDA Edition 01)
FMEA training (AIAG VDA Edition 01)FMEA training (AIAG VDA Edition 01)
FMEA training (AIAG VDA Edition 01)
Ankit Gupta
 
Module 5 - Data Science Methodology.pdf
Module 5 - Data Science Methodology.pdfModule 5 - Data Science Methodology.pdf
Module 5 - Data Science Methodology.pdf
fathiah5
 
Sales Achievements PowerPoint Presentation Slides
Sales Achievements PowerPoint Presentation SlidesSales Achievements PowerPoint Presentation Slides
Sales Achievements PowerPoint Presentation Slides
SlideTeam
 
Test Data Management: The Underestimated Pain
Test Data Management: The Underestimated PainTest Data Management: The Underestimated Pain
Test Data Management: The Underestimated Pain
Chelsea Frischknecht
 
Etl And Data Test Guidelines For Large Applications
Etl And Data Test Guidelines For Large ApplicationsEtl And Data Test Guidelines For Large Applications
Etl And Data Test Guidelines For Large Applications
Wayne Yaddow
 
An AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven OrganizationAn AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven Organization
David Solomon
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
Shailja Khurana
 
Bringing the Power of Big Data Computation to Salesforce
Bringing the Power of Big Data Computation to SalesforceBringing the Power of Big Data Computation to Salesforce
Bringing the Power of Big Data Computation to Salesforce
Salesforce Developers
 
Analytics Overview #Predictive Analytics
Analytics Overview #Predictive AnalyticsAnalytics Overview #Predictive Analytics
Analytics Overview #Predictive Analytics
Durga Palakurthy
 
Data Management vs. Data Governance Program
Data Management vs. Data Governance ProgramData Management vs. Data Governance Program
Data Management vs. Data Governance Program
DATAVERSITY
 

What's hot (20)

2 six sigma
2  six sigma2  six sigma
2 six sigma
 
Msa training
Msa trainingMsa training
Msa training
 
Business Operational Challenges Powerpoint Presentation Slides
Business Operational Challenges Powerpoint Presentation SlidesBusiness Operational Challenges Powerpoint Presentation Slides
Business Operational Challenges Powerpoint Presentation Slides
 
Company Overview Powerpoint Presentation Slides
Company Overview Powerpoint Presentation SlidesCompany Overview Powerpoint Presentation Slides
Company Overview Powerpoint Presentation Slides
 
How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?
 
(BDT205) Your First Big Data Application On AWS
(BDT205) Your First Big Data Application On AWS(BDT205) Your First Big Data Application On AWS
(BDT205) Your First Big Data Application On AWS
 
Rahat Yasir: Enterprise Data & AI Strategy & Platform Designing
Rahat Yasir: Enterprise Data & AI Strategy & Platform DesigningRahat Yasir: Enterprise Data & AI Strategy & Platform Designing
Rahat Yasir: Enterprise Data & AI Strategy & Platform Designing
 
Statistical process control ppt @ bec doms
Statistical process control ppt @ bec domsStatistical process control ppt @ bec doms
Statistical process control ppt @ bec doms
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification
 
FMEA training (AIAG VDA Edition 01)
FMEA training (AIAG VDA Edition 01)FMEA training (AIAG VDA Edition 01)
FMEA training (AIAG VDA Edition 01)
 
Module 5 - Data Science Methodology.pdf
Module 5 - Data Science Methodology.pdfModule 5 - Data Science Methodology.pdf
Module 5 - Data Science Methodology.pdf
 
Sales Achievements PowerPoint Presentation Slides
Sales Achievements PowerPoint Presentation SlidesSales Achievements PowerPoint Presentation Slides
Sales Achievements PowerPoint Presentation Slides
 
Test Data Management: The Underestimated Pain
Test Data Management: The Underestimated PainTest Data Management: The Underestimated Pain
Test Data Management: The Underestimated Pain
 
Etl And Data Test Guidelines For Large Applications
Etl And Data Test Guidelines For Large ApplicationsEtl And Data Test Guidelines For Large Applications
Etl And Data Test Guidelines For Large Applications
 
An AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven OrganizationAn AI Maturity Roadmap for Becoming a Data-Driven Organization
An AI Maturity Roadmap for Becoming a Data-Driven Organization
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Bringing the Power of Big Data Computation to Salesforce
Bringing the Power of Big Data Computation to SalesforceBringing the Power of Big Data Computation to Salesforce
Bringing the Power of Big Data Computation to Salesforce
 
Analytics Overview #Predictive Analytics
Analytics Overview #Predictive AnalyticsAnalytics Overview #Predictive Analytics
Analytics Overview #Predictive Analytics
 
Data Management vs. Data Governance Program
Data Management vs. Data Governance ProgramData Management vs. Data Governance Program
Data Management vs. Data Governance Program
 

Similar to Automated Product Data Preparation: Processes, Methods and Algorithms

AI Solutions for Industries (short)
AI Solutions for Industries (short)AI Solutions for Industries (short)
AI Solutions for Industries (short)
byteLAKE
 
Product content management_process_samples
Product content management_process_samplesProduct content management_process_samples
Product content management_process_samples
Indra kumar
 
Transforming Unstructured Web into Actionable Insights Using AI - Abhimanyu -...
Transforming Unstructured Web into Actionable Insights Using AI - Abhimanyu -...Transforming Unstructured Web into Actionable Insights Using AI - Abhimanyu -...
Transforming Unstructured Web into Actionable Insights Using AI - Abhimanyu -...
CodeOps Technologies LLP
 
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Informatik Aktuell
 
Motivation for Manufacturing Operational Excellencef
Motivation for Manufacturing Operational ExcellencefMotivation for Manufacturing Operational Excellencef
Motivation for Manufacturing Operational Excellencef
Padmini Harish
 
AWS Manufacturing Day Philadelphia-Boston-April 2019
AWS Manufacturing Day Philadelphia-Boston-April 2019AWS Manufacturing Day Philadelphia-Boston-April 2019
AWS Manufacturing Day Philadelphia-Boston-April 2019
Amazon Web Services
 
Retail Design
Retail DesignRetail Design
Retail Design
jagishar
 
Chavangne - Unraveling the MRO Data Knot
Chavangne - Unraveling the MRO Data KnotChavangne - Unraveling the MRO Data Knot
Chavangne - Unraveling the MRO Data Knot
BigDataExpo
 
Erdi güngör bbs
Erdi güngör bbsErdi güngör bbs
Erdi güngör bbs
Erdi Güngör
 
Logitech Accelerates Cloud Analytics Using Data Virtualization by Avinash Des...
Logitech Accelerates Cloud Analytics Using Data Virtualization by Avinash Des...Logitech Accelerates Cloud Analytics Using Data Virtualization by Avinash Des...
Logitech Accelerates Cloud Analytics Using Data Virtualization by Avinash Des...
Data Con LA
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
DataWorks Summit/Hadoop Summit
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
DataWorks Summit/Hadoop Summit
 
Pack Expo 2008
Pack Expo 2008Pack Expo 2008
Pack Expo 2008
Jerry Horne
 
2introtomanufacturing 140130203243-phpapp01
2introtomanufacturing 140130203243-phpapp012introtomanufacturing 140130203243-phpapp01
2introtomanufacturing 140130203243-phpapp01
ovais99
 
Presentatie stefan herold pim in the 360 degree view ronde 2
Presentatie stefan herold pim in the 360 degree view ronde 2Presentatie stefan herold pim in the 360 degree view ronde 2
Presentatie stefan herold pim in the 360 degree view ronde 2
e-Channel Netwerk
 
Machine Learning and Industrie 4.0
Machine Learning and Industrie 4.0Machine Learning and Industrie 4.0
Machine Learning and Industrie 4.0
Peter Schleinitz
 
What's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSSWhat's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSS
Virginia Fernandez
 
What's New in Predictive Analytics IBM SPSS - Apr 2016
What's New in Predictive Analytics IBM SPSS - Apr 2016What's New in Predictive Analytics IBM SPSS - Apr 2016
What's New in Predictive Analytics IBM SPSS - Apr 2016
Edgar Alejandro Villegas
 
From raw data to deployment
From raw data to deployment From raw data to deployment
From raw data to deployment
KNIMESlides
 
byteLAKE's AI Products for Industries (2024-02)
byteLAKE's AI Products for Industries (2024-02)byteLAKE's AI Products for Industries (2024-02)
byteLAKE's AI Products for Industries (2024-02)
byteLAKE
 

Similar to Automated Product Data Preparation: Processes, Methods and Algorithms (20)

AI Solutions for Industries (short)
AI Solutions for Industries (short)AI Solutions for Industries (short)
AI Solutions for Industries (short)
 
Product content management_process_samples
Product content management_process_samplesProduct content management_process_samples
Product content management_process_samples
 
Transforming Unstructured Web into Actionable Insights Using AI - Abhimanyu -...
Transforming Unstructured Web into Actionable Insights Using AI - Abhimanyu -...Transforming Unstructured Web into Actionable Insights Using AI - Abhimanyu -...
Transforming Unstructured Web into Actionable Insights Using AI - Abhimanyu -...
 
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
 
Motivation for Manufacturing Operational Excellencef
Motivation for Manufacturing Operational ExcellencefMotivation for Manufacturing Operational Excellencef
Motivation for Manufacturing Operational Excellencef
 
AWS Manufacturing Day Philadelphia-Boston-April 2019
AWS Manufacturing Day Philadelphia-Boston-April 2019AWS Manufacturing Day Philadelphia-Boston-April 2019
AWS Manufacturing Day Philadelphia-Boston-April 2019
 
Retail Design
Retail DesignRetail Design
Retail Design
 
Chavangne - Unraveling the MRO Data Knot
Chavangne - Unraveling the MRO Data KnotChavangne - Unraveling the MRO Data Knot
Chavangne - Unraveling the MRO Data Knot
 
Erdi güngör bbs
Erdi güngör bbsErdi güngör bbs
Erdi güngör bbs
 
Logitech Accelerates Cloud Analytics Using Data Virtualization by Avinash Des...
Logitech Accelerates Cloud Analytics Using Data Virtualization by Avinash Des...Logitech Accelerates Cloud Analytics Using Data Virtualization by Avinash Des...
Logitech Accelerates Cloud Analytics Using Data Virtualization by Avinash Des...
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
 
Pack Expo 2008
Pack Expo 2008Pack Expo 2008
Pack Expo 2008
 
2introtomanufacturing 140130203243-phpapp01
2introtomanufacturing 140130203243-phpapp012introtomanufacturing 140130203243-phpapp01
2introtomanufacturing 140130203243-phpapp01
 
Presentatie stefan herold pim in the 360 degree view ronde 2
Presentatie stefan herold pim in the 360 degree view ronde 2Presentatie stefan herold pim in the 360 degree view ronde 2
Presentatie stefan herold pim in the 360 degree view ronde 2
 
Machine Learning and Industrie 4.0
Machine Learning and Industrie 4.0Machine Learning and Industrie 4.0
Machine Learning and Industrie 4.0
 
What's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSSWhat's New in Predictive Analytics IBM SPSS
What's New in Predictive Analytics IBM SPSS
 
What's New in Predictive Analytics IBM SPSS - Apr 2016
What's New in Predictive Analytics IBM SPSS - Apr 2016What's New in Predictive Analytics IBM SPSS - Apr 2016
What's New in Predictive Analytics IBM SPSS - Apr 2016
 
From raw data to deployment
From raw data to deployment From raw data to deployment
From raw data to deployment
 
byteLAKE's AI Products for Industries (2024-02)
byteLAKE's AI Products for Industries (2024-02)byteLAKE's AI Products for Industries (2024-02)
byteLAKE's AI Products for Industries (2024-02)
 

Recently uploaded

Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 

Recently uploaded (20)

Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 

Automated Product Data Preparation: Processes, Methods and Algorithms

  • 1. Processes, Methods and Algorithms 4. E-Commerce Meetup @ Liip Automated Product Data Preparation
  • 2. “Big Data is not the new oil. Small and messy data, properly refined, is the new fuel: consumable data” Tobias Widmer CTO @ Onedot.com We are hiring!
  • 3. How do you decide where to buy online?
  • 4. Challenges of Product Data • Some manufacturers and data suppliers are still in the ERP era • Missing universal and globally accepted standards and taxonomies • Complex network of suppliers, distributors and consumers and publishers • Extreme variety, from food to consumer electronics to construction supplies • Product data feeds are often inconsistent, incomplete, corrupt, or all of of them • There is no one-stop shop to get good product data from across categories • Negotiations with data suppliers regarding data quality are cumbersome Product data is one of the most complex kind of data to handle. A new approach to tame product data is required: 1. Stop relying on quasi-standards and embrace the mess out there 2. Take whatever data you can from different data sources 3. Have a reliable and up-to-date product data model on your end 4. Automated data integration and harmonisation to the maximum
  • 5. Introducing Product Data Model A PDM simplifies governance and enables automation. A PDM summarizes which products are described with which attributes. A PDM contains the values needed to provide consistent search filters for end-users. A PDM indicates which information suppliers should provide. A PDM helps to set up a comprehensive product data governance with clear processes. A PDM is required for automated product data preparation. 1 2 3 4 5
  • 6. Automating Data Preparation General data preparation in data science is an iterative process with lots of potential to automate. Identify Sources Profile Data Clean Data Unify Data Schema Mapping Entity Resolution Segmentation Post-Processing Potential for Automation Low Medium High Highest Data Mining Probabilistic Methods Machine Learning Artificial Intelligence DataPreparationProcess
  • 7. Going back to University Automated data preparation pipelines combine probabilistic methods with advanced machine learning algorithms.
  • 8. The Case for Machine Intelligence Probabilistic and statistical methods embrace the fuzziness of product data and allow continuous automated learning. Manual • Slow and error-prone • Limited scalability • Time-consuming education Past Rule-Based • Costly implementation • Expensive maintenance • No automated learning Present Artificial Intelligence • Minimal training using data • Copes with changing data structure • Continuous, automated learning Future
  • 9. Product Data Preparation 6 disciplines for successful product data preparation. Integration Automated Schema Mapping Integrate new products independent of attribute names Golden Record Creation Combine different data records without the need for a primary key Transformation Attribute Normalization Adapt supplier data to your conventions, formats and styles Attribute Extraction Extract attributes from product names and descriptions Categorisation Product Grouping Group products according to your classifications and dynamic groups Product Variants Identification Identify product variants across categories and product groups
  • 10. Schema Mapping Translate schema mapping into a large-scale segmentation problem using linear-time blocking algorithms. Input Schema Target Schema Confidence Level Top 3 Values address->postcode Zip 98% CF11 8TW, SN5 8YW, BS10 7UH model Variant Model Variant 96% D V8 S TIPTRONIC S, Diesel Saloon, Coupe GT4 seatsNumber Number of Seats 96% 5, 4, 2 brandColor Exterior Brand Colour 95% White, Rosso Corsa, Jet Black Metallic m_Unit Mileage Unit 94% miles, km carOptions Car Options 92% 21-inch 911 Turbo II wheel wit…, 20-inch RS Spyder Deisgn wheel… Type Model 90% Cayenne, 911, Ghibli car category Car Type 89% g_km_emission
  • 11. Golden Record Generation Using entity resolution and data fusion algorithms based on relevant weighted attributes to unify product information. Source 1 Source 2 Strategy: Aggregation Strategy: Authority Strategy: Min, Max, Avg.,… Name Colour Size Description ... T-shirt Red XS This product is very light E Name Colour Size Description ... T-shirt S Light product E Name Colour Size Description ... T-shirt Red S This product is very light E Name Colour Size Description ... T-shirt Red XS, S This product is very light, Light product E Name Colour Size Description ... T-shirt Red XS This product is very light E
  • 12. Attribute Normalisation Algorithms based on independent Bayes networks and word embedding classify arbitrary colours into colour families. Input: Supplier free-text colours Output: Normalised colour families Colour natur beige aquamarine olive creative Swiss Cow style Manhattan Grau elegant gold Colour Colour Family natur beige Light Brown aquamarine Blue olive creative Dark Green egg yolk Yellow Manhattan Grau Grey elegant gold Gold
  • 13. Attribute Extraction Extract product attributes from unstructured text and put them into the target data schema using deep learning. Input: Supplier product catalogues Output: Extracted product attributes Product Name Model Storage (GB) Colour Apple iPhone 5S 32 Silver Apple iPhone 7 64 Gold Apple iPhone 7 128 Space Grey Apple iPhone 8 64 Gold Apple iPhone 8 256 Silver Apple iPhone 8 256 Gold Product Description Apple iPhone 5S 32GB Silver Apple iPhone 7 64gb gold Apple iPhone 7 128 Space grey Apple iPhone 8 64 gigabyte goold Apple iPhone 8 256 Gb silber Apple iPhone eight 256 giga-byte Gold
  • 14. Product Categorisation Using hierarchical, independent Bayes networks to categorise products in a product category tree. SKU ID Product Description Predicted Category Confidence Level 21045696 Sanitär-Kreuzschlüssel 200x200mm Chrom Screwdrivers 95% 21045696 Sanitär-Kreuzschlüssel 200x200mm Chrom Screwdrivers 95% 10031686 Bosch Rollenauflage PTA 1000… Filling Stations 93% 10031686 Bosch Rollenauflage PTA 1000… Filling Stations 93% 10031686 Bosch Rollenauflage PTA 1000… Filling Stations 93% 21023947 Mehrzweckhalter 3tlg. verstellb. m. 12Klammer Mounting 86% 94905095 Präz.-Zentrierg. Centro 6-125mm Halmer… Drilling 95% 94905095 Präz.-Zentrierg. Centro 6-125mm Halmer Drilling 95%
  • 15. Product Variant Identification Generating product variants by segmenting product groups and aggregating the resulting product group attributes. Input: Supplier product catalogue Output: Product variants Category ID Manufacturer Name Notebook 422102 Lenovo Lenovo Notebook ThinkPad Yoga 900-13 Notebook 422101 Lenovo Lenovo Notebook ThinkPad Yoga 900-13 Notebook 376675 Lenovo Lenovo Notebook Yoga 900-13 Silber Notebook 370921 Lenovo Lenovo Notebook Yoga 900-13 Silber Category Attribute Examples Notebook Colour Champagne, Silver, Blue Notebook Storage Capacity 128 GB, 256 GB, 512 GB Notebook Display Resolution 1366x768 (WXGA), 1440x900 (WXGA+) Notebook Usage Business, Consumer, Gaming Notebook Connectivity 4G, WiFi, Bluetooth Notebook Processor Family Intel Core i5, Intel Core i7
  • 16. Conclusion • Using a Product Data Model (PDM) is essential for automated product data preparation • Adequate data governance helps keeping the PDM in shape • Most steps in product data preparation can be automated using probabilistic and statistical approaches • Advanced machine learning techniques allow a data pipeline to adapt to the changing data and learn from user feedback We have shown a few ideas to make automated data preparation a time- and cost-saving reality.
  • 17. Q&A