SlideShare a Scribd company logo
1 of 13
Download to read offline
Potter's Wheel : An Interactive
Data Cleaning System
(Raman and Hellerstein, Proc. VLDB, 2001)
Outline
• Problem
• Potter’s wheel
• Architecture
• Discrepancy detection
• Structures
• MDL metric
• Interactive Transformations
• Splitting
• Evaluation
• Conclusion
• Discussion
Problem
• Data cleaning is an important process.
• Cleaning involves data auditing and data
transformations.
• Current solutions (ETL’s and reengineering tools) :
• Iterative.
• Not interactive.
• Long wait times.
Potter’s wheel
• Integrates transformation and discrepancy
detection.
• Interactive transformations.
• Reduced wait times.
Architecture
• 4 parts :
• Data source (tabular, not nested)
• Online reorderer (spreadsheet, sorting, dynamic
display)
• Automatic discrepancy detector (runs in background)
• Transformation engine (applies transforms
immediately and in the background)
Discrepancy detection
• Performed in the background automatically.
• Done by finding suitable structures :
• Structure is a string of domains.
• Custom domains can be defined.
• Find records that do not fit the structure.
• Structures can be parameterized :
• Can use statistics to compute anomalies.
Structures
• What makes a good structure :
• Recall (structure matches as many columns as
possible)
• Precision (structure matches as few as other
possible values; avoid overly broad structures)
• Conciseness (structure should have minimum length;
avoid overfitting)
• How is a structure inferred :
• Minimum Description Length (MDL) metric.
MDL metric
• Distance length (DL) :
• Measure used to describe a set of column values, given a
structure.
• DL(v, S) = (1 – f )(log|ξlen(v)|) + p log m + f (space to express v w.
S)
recall conciseness precision
• Structure inference algorithm :
• Enumerate fixed number of structures recursively.
• Use structure to compute distance length (DL) measure for all
values of a particular column.
• Select structure with the lowest DL.
• Structure found, thus discrepancies found. What’s next?
Interactive
transformations
• GUI provided for simple transformations :
• Add, drop, copy, fold, etc.
• Undo supported.
• GUI not possible for complicated transforms :
• Splitting.
Splitting
• Done by example.
• MDL metric used to infer structures.
• Once structure is inferred, splitting follows :
• Left Right
• Decreasing Specificity
• Increasing Specificity
Evaluation
• Structure inference algorithm works :
• Based on examples.
• Based on algorithm’s definition.
• Decreasing specificity was found to be the faster
splitter :
• Specificity = sum (DL of example values, given S)
• Works best for splits involving many structures.
• Inferring structures superior to inferring regular
expressions :
• Works on custom user-defined domains in a way that is
robust to structural data errors.
Conclusion
• Potter’s wheel tool :
• Interactive
• Integrated
• Future work :
• Transforming nested data
• Complex transforms (e.g., Format via examples)
Thank you

More Related Content

What's hot

[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기Chris Ohk
 
Introduction to Graph neural networks @ Vienna Deep Learning meetup
Introduction to Graph neural networks @  Vienna Deep Learning meetupIntroduction to Graph neural networks @  Vienna Deep Learning meetup
Introduction to Graph neural networks @ Vienna Deep Learning meetupLiad Magen
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks남주 김
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural NetworksPyData
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networksYunjey Choi
 
Graph Representation Learning
Graph Representation LearningGraph Representation Learning
Graph Representation LearningJure Leskovec
 
Coalesced hashing / Hash Coalescido
Coalesced hashing / Hash CoalescidoCoalesced hashing / Hash Coalescido
Coalesced hashing / Hash CoalescidoCriatividadeZeroDocs
 
Mobilenetv1 v2 slide
Mobilenetv1 v2 slideMobilenetv1 v2 slide
Mobilenetv1 v2 slide威智 黃
 
DeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social RepresentationsDeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social RepresentationsSOYEON KIM
 
lecture 26
lecture 26lecture 26
lecture 26sajinsc
 
맵매칭 (부정확한 GPS포인트들로부터 경로 추정하기)
맵매칭 (부정확한 GPS포인트들로부터 경로 추정하기)맵매칭 (부정확한 GPS포인트들로부터 경로 추정하기)
맵매칭 (부정확한 GPS포인트들로부터 경로 추정하기)if kakao
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering
 

What's hot (14)

[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
 
Introduction to Graph neural networks @ Vienna Deep Learning meetup
Introduction to Graph neural networks @  Vienna Deep Learning meetupIntroduction to Graph neural networks @  Vienna Deep Learning meetup
Introduction to Graph neural networks @ Vienna Deep Learning meetup
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural Networks
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Graph Representation Learning
Graph Representation LearningGraph Representation Learning
Graph Representation Learning
 
Coalesced hashing / Hash Coalescido
Coalesced hashing / Hash CoalescidoCoalesced hashing / Hash Coalescido
Coalesced hashing / Hash Coalescido
 
Mobilenetv1 v2 slide
Mobilenetv1 v2 slideMobilenetv1 v2 slide
Mobilenetv1 v2 slide
 
DeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social RepresentationsDeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social Representations
 
lecture 26
lecture 26lecture 26
lecture 26
 
Adaptive Resonance Theory (ART)
Adaptive Resonance Theory (ART)Adaptive Resonance Theory (ART)
Adaptive Resonance Theory (ART)
 
맵매칭 (부정확한 GPS포인트들로부터 경로 추정하기)
맵매칭 (부정확한 GPS포인트들로부터 경로 추정하기)맵매칭 (부정확한 GPS포인트들로부터 경로 추정하기)
맵매칭 (부정확한 GPS포인트들로부터 경로 추정하기)
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 
Dcgan
DcganDcgan
Dcgan
 

Similar to Potters wheel (20)

10-System-ModelingFL22-sketch-19122022-091234am.pptx
10-System-ModelingFL22-sketch-19122022-091234am.pptx10-System-ModelingFL22-sketch-19122022-091234am.pptx
10-System-ModelingFL22-sketch-19122022-091234am.pptx
 
No sql Database
No sql DatabaseNo sql Database
No sql Database
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Software Design - SDLC Model
Software Design - SDLC ModelSoftware Design - SDLC Model
Software Design - SDLC Model
 
David buksbaum a-briefintroductiontocsharp
David buksbaum a-briefintroductiontocsharpDavid buksbaum a-briefintroductiontocsharp
David buksbaum a-briefintroductiontocsharp
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
 
Editors l21 l24
Editors l21 l24Editors l21 l24
Editors l21 l24
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
Lec01-Algorithems - Introduction and Overview.pdf
Lec01-Algorithems - Introduction and Overview.pdfLec01-Algorithems - Introduction and Overview.pdf
Lec01-Algorithems - Introduction and Overview.pdf
 
Dbms
DbmsDbms
Dbms
 
dbms.ppt
dbms.pptdbms.ppt
dbms.ppt
 
dbms.ppt
dbms.pptdbms.ppt
dbms.ppt
 
dbms (1).ppt
dbms (1).pptdbms (1).ppt
dbms (1).ppt
 
dbms.ppt
dbms.pptdbms.ppt
dbms.ppt
 
dataflowdiagram2 121005140736-phpapp01
dataflowdiagram2 121005140736-phpapp01dataflowdiagram2 121005140736-phpapp01
dataflowdiagram2 121005140736-phpapp01
 
Unit 1- dbms.ppt
Unit 1- dbms.pptUnit 1- dbms.ppt
Unit 1- dbms.ppt
 
UML Intro
UML IntroUML Intro
UML Intro
 

More from dhruvgairola

A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...dhruvgairola
 
Differences bet. versions of UML diagrams.
Differences bet. versions of UML diagrams.Differences bet. versions of UML diagrams.
Differences bet. versions of UML diagrams.dhruvgairola
 
A Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC LearningA Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC Learningdhruvgairola
 
Discussion : Info sharing across private DBs
Discussion : Info sharing across private DBsDiscussion : Info sharing across private DBs
Discussion : Info sharing across private DBsdhruvgairola
 

More from dhruvgairola (8)

A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
A Generic Algebraic Model for the Analysis of Cryptographic Key Assignment Sc...
 
Differences bet. versions of UML diagrams.
Differences bet. versions of UML diagrams.Differences bet. versions of UML diagrams.
Differences bet. versions of UML diagrams.
 
Beginning jQuery
Beginning jQueryBeginning jQuery
Beginning jQuery
 
Beginning CSS.
Beginning CSS.Beginning CSS.
Beginning CSS.
 
A Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC LearningA Theory of the Learnable; PAC Learning
A Theory of the Learnable; PAC Learning
 
Discussion : Info sharing across private DBs
Discussion : Info sharing across private DBsDiscussion : Info sharing across private DBs
Discussion : Info sharing across private DBs
 
PRIMES is in P
PRIMES is in PPRIMES is in P
PRIMES is in P
 
Ajax
AjaxAjax
Ajax
 

Recently uploaded

Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Recently uploaded (20)

Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

Potters wheel

  • 1. Potter's Wheel : An Interactive Data Cleaning System (Raman and Hellerstein, Proc. VLDB, 2001)
  • 2. Outline • Problem • Potter’s wheel • Architecture • Discrepancy detection • Structures • MDL metric • Interactive Transformations • Splitting • Evaluation • Conclusion • Discussion
  • 3. Problem • Data cleaning is an important process. • Cleaning involves data auditing and data transformations. • Current solutions (ETL’s and reengineering tools) : • Iterative. • Not interactive. • Long wait times.
  • 4. Potter’s wheel • Integrates transformation and discrepancy detection. • Interactive transformations. • Reduced wait times.
  • 5. Architecture • 4 parts : • Data source (tabular, not nested) • Online reorderer (spreadsheet, sorting, dynamic display) • Automatic discrepancy detector (runs in background) • Transformation engine (applies transforms immediately and in the background)
  • 6. Discrepancy detection • Performed in the background automatically. • Done by finding suitable structures : • Structure is a string of domains. • Custom domains can be defined. • Find records that do not fit the structure. • Structures can be parameterized : • Can use statistics to compute anomalies.
  • 7. Structures • What makes a good structure : • Recall (structure matches as many columns as possible) • Precision (structure matches as few as other possible values; avoid overly broad structures) • Conciseness (structure should have minimum length; avoid overfitting) • How is a structure inferred : • Minimum Description Length (MDL) metric.
  • 8. MDL metric • Distance length (DL) : • Measure used to describe a set of column values, given a structure. • DL(v, S) = (1 – f )(log|ξlen(v)|) + p log m + f (space to express v w. S) recall conciseness precision • Structure inference algorithm : • Enumerate fixed number of structures recursively. • Use structure to compute distance length (DL) measure for all values of a particular column. • Select structure with the lowest DL. • Structure found, thus discrepancies found. What’s next?
  • 9. Interactive transformations • GUI provided for simple transformations : • Add, drop, copy, fold, etc. • Undo supported. • GUI not possible for complicated transforms : • Splitting.
  • 10. Splitting • Done by example. • MDL metric used to infer structures. • Once structure is inferred, splitting follows : • Left Right • Decreasing Specificity • Increasing Specificity
  • 11. Evaluation • Structure inference algorithm works : • Based on examples. • Based on algorithm’s definition. • Decreasing specificity was found to be the faster splitter : • Specificity = sum (DL of example values, given S) • Works best for splits involving many structures. • Inferring structures superior to inferring regular expressions : • Works on custom user-defined domains in a way that is robust to structural data errors.
  • 12. Conclusion • Potter’s wheel tool : • Interactive • Integrated • Future work : • Transforming nested data • Complex transforms (e.g., Format via examples)