Potters wheel

•

1 like•1,277 views

dhruvgairola

Technology Business

Potter's Wheel : An Interactive
Data Cleaning System
(Raman and Hellerstein, Proc. VLDB, 2001)

Outline
• Problem
• Potter’s wheel
• Architecture
• Discrepancy detection
• Structures
• MDL metric
• Interactive Transformations
• Splitting
• Evaluation
• Conclusion
• Discussion

Problem
• Data cleaning is an important process.
• Cleaning involves data auditing and data
transformations.
• Current solutions (ETL’s and reengineering tools) :
• Iterative.
• Not interactive.
• Long wait times.

Potter’s wheel
• Integrates transformation and discrepancy
detection.
• Interactive transformations.
• Reduced wait times.

Architecture
• 4 parts :
• Data source (tabular, not nested)
• Online reorderer (spreadsheet, sorting, dynamic
display)
• Automatic discrepancy detector (runs in background)
• Transformation engine (applies transforms
immediately and in the background)

Discrepancy detection
• Performed in the background automatically.
• Done by finding suitable structures :
• Structure is a string of domains.
• Custom domains can be defined.
• Find records that do not fit the structure.
• Structures can be parameterized :
• Can use statistics to compute anomalies.

Structures
• What makes a good structure :
• Recall (structure matches as many columns as
possible)
• Precision (structure matches as few as other
possible values; avoid overly broad structures)
• Conciseness (structure should have minimum length;
avoid overfitting)
• How is a structure inferred :
• Minimum Description Length (MDL) metric.

MDL metric
• Distance length (DL) :
• Measure used to describe a set of column values, given a
structure.
• DL(v, S) = (1 – f )(log|ξlen(v)|) + p log m + f (space to express v w.
S)
recall conciseness precision
• Structure inference algorithm :
• Enumerate fixed number of structures recursively.
• Use structure to compute distance length (DL) measure for all
values of a particular column.
• Select structure with the lowest DL.
• Structure found, thus discrepancies found. What’s next?

Interactive
transformations
• GUI provided for simple transformations :
• Add, drop, copy, fold, etc.
• Undo supported.
• GUI not possible for complicated transforms :
• Splitting.

Splitting
• Done by example.
• MDL metric used to infer structures.
• Once structure is inferred, splitting follows :
• Left Right
• Decreasing Specificity
• Increasing Specificity

Evaluation
• Structure inference algorithm works :
• Based on examples.
• Based on algorithm’s definition.
• Decreasing specificity was found to be the faster
splitter :
• Specificity = sum (DL of example values, given S)
• Works best for splits involving many structures.
• Inferring structures superior to inferring regular
expressions :
• Works on custom user-defined domains in a way that is
robust to structural data errors.

Conclusion
• Potter’s wheel tool :
• Interactive
• Integrated
• Future work :
• Transforming nested data
• Complex transforms (e.g., Format via examples)

What's hot

[NDC 2018] 유체역학 엔진 개발기Chris Ohk

Introduction to Graph neural networks @ Vienna Deep Learning meetupLiad Magen

Generative adversarial networks남주 김

Transfer Learning and Fine-tuning Deep Neural NetworksPyData

Generative adversarial networksYunjey Choi

Graph Representation LearningJure Leskovec

Coalesced hashing / Hash CoalescidoCriatividadeZeroDocs

Mobilenetv1 v2 slide威智黃

DeepWalk: Online Learning of Social RepresentationsSOYEON KIM

lecture 26sajinsc

Adaptive Resonance Theory (ART)Amir Masoud Sefidian

맵매칭 (부정확한 GPS포인트들로부터 경로 추정하기)if kakao

1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering

DcganBrian Kim

What's hot (14)

[NDC 2018] 유체역학 엔진 개발기

Introduction to Graph neural networks @ Vienna Deep Learning meetup

Generative adversarial networks

Transfer Learning and Fine-tuning Deep Neural Networks

Generative adversarial networks

Graph Representation Learning

Coalesced hashing / Hash Coalescido

Mobilenetv1 v2 slide

DeepWalk: Online Learning of Social Representations

lecture 26

Adaptive Resonance Theory (ART)

맵매칭 (부정확한 GPS포인트들로부터 경로 추정하기)

1시간만에 GAN(Generative Adversarial Network) 완전 정복하기

Dcgan

Similar to Potters wheel

10-System-ModelingFL22-sketch-19122022-091234am.pptxhuzaifaahmed79

No sql Databasemymail2ashok

A tour of Amazon RedshiftKel Graham

Intro.pptSharabiNaif

Intro.pptAnonymous9etQKwW

Intro_2.pptMumitAhmed1

David buksbaum a-briefintroductiontocsharpJorge Antonio Contre Vargas

L5. Data Transformation and Feature EngineeringMachine Learning Valencia

Editors l21 l24Neha Pachauri

Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren

Lec01-Algorithems - Introduction and Overview.pdfMAJDABDALLAH3

DbmsMaria Stella Solon

dbms.pptThontadharyaThontadh

dbms.pptGeorgeSamaan9

dbms (1).pptUbaidURRahman78

dbms.pptKRISHNARAJ207

dataflowdiagram2 121005140736-phpapp01Shashi soni

Unit 1- dbms.pptminnu41

UML Introkoppenolski

Similar to Potters wheel (20)

10-System-ModelingFL22-sketch-19122022-091234am.pptx

No sql Database

A tour of Amazon Redshift

Intro.ppt

Intro_2.ppt

Software Design - SDLC Model

David buksbaum a-briefintroductiontocsharp

L5. Data Transformation and Feature Engineering

Editors l21 l24

Think Like Spark: Some Spark Concepts and a Use Case

Lec01-Algorithems - Introduction and Overview.pdf

Dbms

dbms.ppt

dbms (1).ppt

dbms.ppt

dataflowdiagram2 121005140736-phpapp01

Unit 1- dbms.ppt

UML Intro

Recently uploaded

Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

UiPath Community: Communication Mining from Zero to HeroUiPathCommunity

Connecting the Dots for Information Discovery.pdfNeo4j

Infrared simulation and processing on Nvidia platformsYoss Cohen

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar

All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough

Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA

2024 April Patch TuesdayIvanti

React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma

Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

QCon London: Mastering long-running processes in modern architecturesBernd Ruecker

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica

A Framework for Development in the AI AgeCprime

Recently uploaded (20)

Glenn Lazarus- Why Your Observability Strategy Needs Security Observability

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

UiPath Community: Communication Mining from Zero to Hero

Connecting the Dots for Information Discovery.pdf

Infrared simulation and processing on Nvidia platforms

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes

All These Sophisticated Attacks, Can We Really Detect Them - PDF

Long journey of Ruby standard library at RubyConf AU 2024

2024 April Patch Tuesday

React JS; all concepts. Contains React Features, JSX, functional & Class comp...

Generative AI - Gitex v1Generative AI - Gitex v1.pptx

Generative Artificial Intelligence: How generative AI works.pdf

Testing tools and AI - ideas what to try with some tool examples

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

QCon London: Mastering long-running processes in modern architectures

Moving Beyond Passwords: FIDO Paris Seminar.pdf

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure

A Framework for Development in the AI Age

Potters wheel

1. Potter's Wheel : An Interactive Data Cleaning System (Raman and Hellerstein, Proc. VLDB, 2001)

2. Outline • Problem • Potter’s wheel • Architecture • Discrepancy detection • Structures • MDL metric • Interactive Transformations • Splitting • Evaluation • Conclusion • Discussion

3. Problem • Data cleaning is an important process. • Cleaning involves data auditing and data transformations. • Current solutions (ETL’s and reengineering tools) : • Iterative. • Not interactive. • Long wait times.

4. Potter’s wheel • Integrates transformation and discrepancy detection. • Interactive transformations. • Reduced wait times.

5. Architecture • 4 parts : • Data source (tabular, not nested) • Online reorderer (spreadsheet, sorting, dynamic display) • Automatic discrepancy detector (runs in background) • Transformation engine (applies transforms immediately and in the background)

6. Discrepancy detection • Performed in the background automatically. • Done by finding suitable structures : • Structure is a string of domains. • Custom domains can be defined. • Find records that do not fit the structure. • Structures can be parameterized : • Can use statistics to compute anomalies.

7. Structures • What makes a good structure : • Recall (structure matches as many columns as possible) • Precision (structure matches as few as other possible values; avoid overly broad structures) • Conciseness (structure should have minimum length; avoid overfitting) • How is a structure inferred : • Minimum Description Length (MDL) metric.

8. MDL metric • Distance length (DL) : • Measure used to describe a set of column values, given a structure. • DL(v, S) = (1 – f )(log|ξlen(v)|) + p log m + f (space to express v w. S) recall conciseness precision • Structure inference algorithm : • Enumerate fixed number of structures recursively. • Use structure to compute distance length (DL) measure for all values of a particular column. • Select structure with the lowest DL. • Structure found, thus discrepancies found. What’s next?

9. Interactive transformations • GUI provided for simple transformations : • Add, drop, copy, fold, etc. • Undo supported. • GUI not possible for complicated transforms : • Splitting.

10. Splitting • Done by example. • MDL metric used to infer structures. • Once structure is inferred, splitting follows : • Left Right • Decreasing Specificity • Increasing Specificity

11. Evaluation • Structure inference algorithm works : • Based on examples. • Based on algorithm’s definition. • Decreasing specificity was found to be the faster splitter : • Specificity = sum (DL of example values, given S) • Works best for splits involving many structures. • Inferring structures superior to inferring regular expressions : • Works on custom user-defined domains in a way that is robust to structural data errors.

12. Conclusion • Potter’s wheel tool : • Interactive • Integrated • Future work : • Transforming nested data • Complex transforms (e.g., Format via examples)

13. Thank you

Potters wheel

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Potters wheel

Similar to Potters wheel (20)

More from dhruvgairola

More from dhruvgairola (8)

Recently uploaded

Recently uploaded (20)

Potters wheel