SlideShare a Scribd company logo
1 of 50
A diamond is just a piece of charcoal with better structure
FOOFAH: Transforming Data by Example
“Mark” Zhongjun Jin, Michael R. Anderson, Michael Cafarella, H. V. Jagadish
University of Michigan 1
Unstructured Data is Unusable Until It is Transformed
Tel Fax
Niles C. (800)645-8397 (907)586-7252
Jean H. (918)781-4600 (918)781-4604
Frank K. (615)564-6500 (615)564-6701
…
Structured Contact Information
Bureau of I.A.
Regional Director Numbers
Niles C. Tel:(800)645-8397
Fax:(907)586-7252
Jean H. Tel:(918)781-4600
Fax:(918)781-4604
Frank K. Tel:(615)564-6500
Fax:(615)564-6701
…
Database
Unstructured Contact Information
2
Traditional Data Transformation is Expensive
Structured Contact Information
1. Use Excels
2. Write programs
Bureau of I.A.
Regional Director Numbers
Niles C. Tel:(800)645-8397
Fax:(907)586-7252
Jean H. Tel:(918)781-4600
Fax:(918)781-4604
Frank K. Tel:(615)564-6500
Fax:(615)564-6701
…
Bureau of I.A.
Regional Director Numbers
Niles C. Tel:(800)645-8397
Fax:(907)586-7252
Jean H. Tel:(918)781-4600
Fax:(918)781-4604
Frank K. Tel:(615)564-6500
Fax:(615)564-6701
…
Unstructured Contact Information
Tel Fax
Niles C. (800)645-8397 (907)586-7252
Jean H. (918)781-4600 (918)781-4604
Frank K. (615)564-6500 (615)564-6701
…
3
1. Use Excels
2. Write programs
3. ???
Traditional Data Transformation is Expensive
Structured Contact Information
Bureau of I.A.
Regional Director Numbers
Niles C. Tel:(800)645-8397
Fax:(907)586-7252
Jean H. Tel:(918)781-4600
Fax:(918)781-4604
Frank K. Tel:(615)564-6500
Fax:(615)564-6701
…
Bureau of I.A.
Regional Director Numbers
Niles C. Tel:(800)645-8397
Fax:(907)586-7252
Jean H. Tel:(918)781-4600
Fax:(918)781-4604
Frank K. Tel:(615)564-6500
Fax:(615)564-6701
…
Unstructured Contact Information
Tel Fax
Niles C. (800)645-8397 (907)586-7252
Jean H. (918)781-4600 (918)781-4604
Frank K. (615)564-6500 (615)564-6701
…
4
Data Wrangler
• “Wrangler: Interactive Visual Specification of Data Transformation
Scripts”, Sean Kandel, Andreas Paepcke, Joseph Hellerstein and Jeffrey
Heer, CHI 2011.
5
Programming-By-Demonstration(PBD) Model
• System learns the program by user demonstrating the actions
required to complete the task
Step 2: System Suggests
Some Operations Step 3: User Picks an Operation
Step 1: User Partially Demonstrates an Operation
6
No programming
skills required!
Possible to improve?
Usability Issues with Programming-By-
Demonstration Data Transformation
1. Repetitive and tedious
• User effort is proportional to program length
2. Requires good knowledge about data transformation
7
Our Proposal
• Data Transformation through Programming-By-Example (PBE)
• User provides input-output examples rather than selecting the correct
operations
• Previous data transformation PBE work by Gulwani et. al.
• FlashFill [Gulwani 2011], FlashRelate [Barowy et. al. 2015], ProgFromEx [Harris,
Gulwani 2011]
8
Outline
• Introduction
• Problem Definition
• Our Solution
• Program Synthesis as a Search Problem
• A New Heuristic: Table Edit Distance + batching
• Experiments
• Conclusion
9
Expected PBE Interaction Model
10
Foofah
Input-output
Example
Synthesized
Program
Raw Data
User Input
Input-output Example: ℰ= (ei, eo)
ℰ = Sampled raw data ei + Transformed view of the sample ei
ei eo
11
Machine Learning
Data
Visualization
Data Integration
Sample Input-output Example
Bureau of I.A.
Regional Director Numbers
Niles C. Tel: (800)645-8397
Fax: (907)586-7252
Jean H. Tel: (918)781-4600
Fax: (918)781-4604
Frank K. Tel: (615)564-6500
Fax: (615)564-6701
…
Raw data
Bureau of I.A.
Regional Director Numbers
Niles C. Tel: (800)645-8397
Fax: (907)586-7252
Jean H. Tel: (918)781-4600
Fax: (918)781-4604
Sampled raw data
Tel Fax
Niles C. (800)645-8397 (907)586-7252
Jean H. (918)781-4600 (918)781-4604
Transformed view of the sample
12
System Output
Potter’s Wheel program:
A sequence of parameterized Potter’s Wheel operators
[Raman, Hellerstein 2001]
• Potter’s Wheel program
• Operators 𝓕:
Drop, Move, Copy, Merge, Split, Fold, Unfold, Fill, Divide, Delete, Extract,
Transpose
𝓕1[p1, p2,…](x) 𝓕2[p1, p2,…](x)
13
Raw Data “Evolved” Data More “Evolved” Data
…maybe more
Problem to Solve
Synthesize a program 𝒫 that takes ei as the input and outputs eo
ei eo
Data Transformation Program
𝒫
?
14
Machine
Learning
Data
Visualization
Data
Integration
Raw Data
• A grid of values, i.e., spreadsheets
• “Somewhat” structured: must have some regular structure or is
automatically generated.
15
Health RecordsE-receiptSystem Log
AND MORE …
Desired Transformations
• Supported: Layout transformations and/or syntactic transformations
• Not Supported: Semantic transformations
• E.g., May 17 2017 → 05 17 2017
16
Example of a syntactic transformation
05-16-2017
05-17-2017
05-18-2017
…
05/16/2017
05/17/2017
05/18/2017
…
Example of a layout transformation
Outline
• Introduction
• Problem Definition
• Our Solution
• Program Synthesis as a Search Problem
• A New Heuristic: Table Edit Distance + batching
• Experiments
• Conclusion
17
18
Program Synthesis as a Search Problem
Input Example ei
Output Example eo
?
19
Program Synthesis as a Search Problem
Input Example ei
?
Output Example eo
20
Program Synthesis as a Search Problem
Input Example ei
? Output Example eo
21
Program Synthesis as a Search Problem
Input Example ei
Output Example eo
Search Space: All transformation programs
States: Different views of the data
Edges: Parameterized operators
A* search algorithm keeps exploring the most promising node -
smallest f(n)
• f(n) = g(n) + h(n)
A* Search
Observed Part: traveled distance
for the current state
Intermediate state ‘n’
22
Estimated Part: estimated
distance to the goal state
Applying A* Search Algorithm in Our Problem
• Cost function: # of Potter’s Wheel operators
23
Input Example ei
g(n): “# of Potter’s Wheel operators”
Output Example eo
Requirements for Heuristic
1. Easy to compute
2. Effective
• Prioritizes correct states, punishes incorrect states
3. Heuristic scale is not too large
• If heuristic score is too large, A* search could be misled by imperfect heuristic
f(n) = g(n) + h(n) h(n) ≫ g(n) f(n) ≈ h(n)
24
Possible Heuristics and Why They Are Bad
1. Operator-level heuristic
• Estimate # of Potter’s Wheel operations to use
25
Requirements for heuristic
1. Easy to compute ❌
2. Effective❓
3. Close to g(n) scale ✔
Estimating # of Potter’s Wheel operators
is hard. Details are in our paper.
Possible Heuristics and Why They Are Bad
1. Operator-level
• Estimate # of Potter’s Wheel operations to use
2. Column-level heuristic
• # of different columns
26
Requirements for heuristic
1. Easy to compute ✔
2. Effective ❌
3. Close to g(n) scale ❓
Operators like Fold, Unfold, Delete,
Transform, etc., move or remove cells
across multiple columns
Possible Heuristics and Why They Are Bad
1. Operator-level heuristic
• Estimate # of Potter’s Wheel operations to use
2. Column-level heuristic
• # of different columns
3. Cell-level heuristic
27
Requirements for heuristic
1. Easy to compute ✔
2. Effective ❓
3. Close to g(n) scale ❓
Intuition:
• A correct operation makes table more similar to its desired state
• The similar a table is to its desired state, the fewer cells need to be changed
Outline
• Introduction
• Problem Definition
• Our Solution
• Program Synthesis as a Search Problem
• A New Heuristic: Table Edit Distance + batching
• Experiments
• Conclusion
28
Most data transformation operations can be seen as many cell-level
transformation operations:
• Add
• Remove
• Move
• Transform
Observation about Data Transformation Ops
Cell-level operators
Move a cell
Alice Math A+
Transform a cell
Mike Anderson University of Michigan PhD Student
Alice A+ Math Mike Anderson University of Michigan PhD
Add a cell
Alice Math A+
Math
Alice A+
Remove a cell
Mike Anderson University of Michigan PhD Student
Mike Anderson University of Michigan
Example: Split operation
Tel: (800)645-8397
Fax: (907)586-7252
Tel (800)645-8397
Fax (907)586-7252
Split on ‘:’
Tel: (800)645-8397
Fax: (907)586-7252
Tel (800)645-8397
Fax (907)586-7252
Transform (0,0) to (0,0)
Potter’s Wheel Operator
Cell-level Operator
31
Transform (0,0) to (0,1)
Transform (1,0) to (1,0) Transform (1,0) to (1,1)
• Table Edit Distance Definition:
Distance 𝑇1, 𝑇2 = min
𝑝1,… , 𝑝 𝑘 ∈𝑃 𝑇1, 𝑇2
𝑖=0
𝑘
𝑐𝑜𝑠𝑡 𝑝𝑖
• P(T1, T2): Set of all “paths” transforming T1 to T2 using cell-level operators
• p1,…, pk: cell-level operators, including
Add, Remove, Move & Transform
Cell-level heuristic: Table Edit Distance
32
Current Cell-level Heuristic Is
33
1. Easy to compute ✔
2. Effective: Prioritizing correct states, punishing incorrect states ✔
3. Close to g(n) scale ❓
• Data transformation operators (and their cell-level operators)
transform geometrically-adjacent cells
Tel: (800)645-8397
Fax: (907)586-7252
Tel: (918)781-4600
Fax: (918)781-4604
Tel (800)645-8397
Fax
Tel
Fax
(907)586-7252
(918)781-4600
(918)781-4604
Observation about data transformation
operators
1 Split = 8 Transforms on 2 groups of Vertically adjacent cells
34
• Group geometrically-adjacent cell-level operations by their goemetric
patterns
Our Solution: Table Edit Distance Batch
Tel: (800)645-8397
Fax: (907)586-7252
Tel: (918)781-4600
Fax: (918)781-4604
Tel
Fax
Tel
Fax
Tel: (800)645-8397
Fax: (907)586-7252
Tel: (918)781-4600
Fax: (918)781-4604
(800)645-8397
(907)586-7252
(918)781-4600
(918)781-4604
batched Transform 1 batch Transform 2
35
Requirements for heuristic
1. Easy to compute ✔
2. Effective ✔
3. Close to g(n) scale ✔
Table Edit Distance + Batch has exponential
complexity. We use greedy algorithm. Detailed
algorithm is in our paper
Outline
• Introduction
• Problem Definition
• Our Solution
• Program Synthesis as a Search Problem
• A New Heuristic: Table Edit Distance + batching
• Experiments
• Conclusions
36
Claims
Our proposed PBE technique:
1. Can handle most data transformation tasks
2. Requires a little user effort
3. Is generally efficient (small system runtime)
37
Test Settings
• Prototype
• Foofah
• Benchmark Workload
• 50 test scenarios collected from related work: ProgFromEx [Harris, Gulwani
2011], Wrangler [Kandel et. al., 2011], Potter’s Wheel [Raman, Hellerstein
2001], ProactiveWrangler [Guo et. al. 2011]
• Baselines
• FlashRelate, ProgFromEx, Wrangler
• Test Environment
• a 16-core (2.53GHz) Intel Xeon E5630 machine with 120 GB RAM
38
Test Approach: lazy approach [Harris, Gulwani 2011]
39
raw data Foofah
raw data
Input Example + Output Example
synthesized
program
End
perfectly
transformed
one record
Foofah Handles Most Transformation Tasks
88.40%
97.70%
74.40%
55.80%
0%
20%
40%
60%
80%
100%
Pure layout transformation
Foofah FlashRelate
ProgFromEx Wrangler
40
Conclusion:
• Foofah can handle most of benchmark workload
• Ideally, Wrangler should handle as many tasks as our system
100.00%
0.00% 0.00%
85.70%
0%
20%
40%
60%
80%
100%
Include syntactic transformations
Foofah FlashRelate ProgFromEx Wrangler
Foofah Requires Few Inputs
41
Conclusion:
• Foofah mostly requires a small input-output example (input example contains no
more than 2 records) in our test workload
50.00%
40.00%
10.00%
0%
20%
40%
60%
80%
100%
1 2 Failure
Number of input-output examples
required for benchmark tests
Foofah is Efficient
42
Average Runtime: 1.4 SecondsConclusion:
• System runtime for each synthesis is short for most benchmark workload
74.00%
86.00% 88.00%
0%
20%
40%
60%
80%
100%
≤ 1 sec ≤ 5 sec ≤ 30 sec
Worst-case system runtime for each
synthesis
Foofah Saves More User Effort than Wrangler
43
0
100
200
300
400
500
600
TotalCompletionTime(S)
Wrangler FoofahConclusion
• Foofah on average requires about 60% less user effort than
Wrangler and never loses
Test Settings:
• 8 selected tasks covering both
complex and simple tasks
• 10 grad student participants
• 600 seconds limit for each task
Conclusion and Future Work
• The proposed PBE data transformation technique:
1. Consumes less user effort than state-of-the-art PBD data wrangling system
Data Wrangler
2. Solves more data transformation tasks than existing PBE data
transformation systems
• Next:
• Build interface allowing the user to incorporate new operators
• Add tolerance for inaccurate user-provided examples
44
Thanks to prior works and, of course, you!
1. Kandel, Sean, et al. “Wrangler: Interactive visual specification of data
transformation scripts.” CHI, 2011.
2. V. Raman and J. M. Hellerstein. “Potter’s Wheel: An interactive data cleaning
system”. VLDB, 2001.
3. Gulwani, Sumit. "Automating string processing in spreadsheets using input-
output examples." ACM SIGPLAN Notices. Vol. 46. No. 1. ACM, 2011.
4. Harris, William R., and Sumit Gulwani. "Spreadsheet table transformations
from examples." ACM SIGPLAN Notices. Vol. 46. No. 6. ACM, 2011.
5. Barowy, Daniel W., et al. "FlashRelate: extracting relational data from semi-
structured spreadsheets using examples." ACM SIGPLAN Notices. Vol. 50. No. 6.
ACM, 2015.
6. Guo, Philip J., et al. "Proactive wrangling: mixed-initiative end-user
programming of data transformation scripts." Proceedings of the 24th annual
ACM symposium on User interface software and technology. ACM, 2011.
45
Niles C. Tel (800)645-8397
Niles C. Fax (907)586-7252
Unfold columns2,
column3 on column3
Tel Fax
Niles C. (800)645-8397 (907)586-7252
Tel (800)645-8397
Fax (907)586-7252 (907)586-7252(800)645-8397
Tel Fax
Move (1,1) to (0,2)
Niles C.
Niles C. Niles C.
Potter’s Wheel Operator
Cell-level Operator Move (0,2) to (1,1)
Delete (0,0)
Demonstrating Example: Unfold operation
46
Table Edit Distance Batch
• Batch “geometric patterns”
1. Vertical-to-vertical, vertical-to-horizontal, horizontal-to-horizontal,
horizontal-to-vertical
2. one-to-horizontal, one-to-vertical
3. Horizontal-to-null, vertical-to-null
47
Experiments on Competing Algorithms
48
Pruning Rules
49
Experiments on Adding New Operators
50

More Related Content

Similar to Foofah: Data Transformation by Example (SIGMOD 2017)

Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataWeCloudData
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionRrubaa Panchendrarajan
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsJen Stirrup
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
Efficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresEfficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresCrai Macdonald
 
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph CompletionNaomi Shiraishi
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
Bob Garrett: Network of Networks Analysis
Bob Garrett: Network of Networks AnalysisBob Garrett: Network of Networks Analysis
Bob Garrett: Network of Networks AnalysisEnergyTech2015
 
Open Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design PatternsOpen Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design PatternsMatthew Kalan
 
Dagobahic2020orange
Dagobahic2020orangeDagobahic2020orange
Dagobahic2020orangeJixiongLIU
 
Keepin’ It Real(-Time) With Nadine Farah | Current 2022
Keepin’ It Real(-Time) With Nadine Farah | Current 2022Keepin’ It Real(-Time) With Nadine Farah | Current 2022
Keepin’ It Real(-Time) With Nadine Farah | Current 2022HostedbyConfluent
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Thanh Tran
 

Similar to Foofah: Data Transformation by Example (SIGMOD 2017) (20)

Licentiate Defense Slide
Licentiate Defense SlideLicentiate Defense Slide
Licentiate Defense Slide
 
Deep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and Statistics
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
Efficient Query Processing Infrastructures
Efficient Query Processing InfrastructuresEfficient Query Processing Infrastructures
Efficient Query Processing Infrastructures
 
Searching Algorithms
Searching AlgorithmsSearching Algorithms
Searching Algorithms
 
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
論文紹介:Graph Pattern Entity Ranking Model for Knowledge Graph Completion
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Bob Garrett: Network of Networks Analysis
Bob Garrett: Network of Networks AnalysisBob Garrett: Network of Networks Analysis
Bob Garrett: Network of Networks Analysis
 
Open Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design PatternsOpen Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design Patterns
 
Presentation No.1
Presentation No.1Presentation No.1
Presentation No.1
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
19.pptx
19.pptx19.pptx
19.pptx
 
Dagobahic2020orange
Dagobahic2020orangeDagobahic2020orange
Dagobahic2020orange
 
Keepin’ It Real(-Time) With Nadine Farah | Current 2022
Keepin’ It Real(-Time) With Nadine Farah | Current 2022Keepin’ It Real(-Time) With Nadine Farah | Current 2022
Keepin’ It Real(-Time) With Nadine Farah | Current 2022
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
 

Recently uploaded

一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancingmohamed Elzalabany
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证dq9vz1isj
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp onlinebalibahu1313
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7komalsharmaa480
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一0uyfyq0q4
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxDilipVasan
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一fztigerwe
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一w7jl3eyno
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 

Recently uploaded (20)

一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 

Foofah: Data Transformation by Example (SIGMOD 2017)

  • 1. A diamond is just a piece of charcoal with better structure FOOFAH: Transforming Data by Example “Mark” Zhongjun Jin, Michael R. Anderson, Michael Cafarella, H. V. Jagadish University of Michigan 1
  • 2. Unstructured Data is Unusable Until It is Transformed Tel Fax Niles C. (800)645-8397 (907)586-7252 Jean H. (918)781-4600 (918)781-4604 Frank K. (615)564-6500 (615)564-6701 … Structured Contact Information Bureau of I.A. Regional Director Numbers Niles C. Tel:(800)645-8397 Fax:(907)586-7252 Jean H. Tel:(918)781-4600 Fax:(918)781-4604 Frank K. Tel:(615)564-6500 Fax:(615)564-6701 … Database Unstructured Contact Information 2
  • 3. Traditional Data Transformation is Expensive Structured Contact Information 1. Use Excels 2. Write programs Bureau of I.A. Regional Director Numbers Niles C. Tel:(800)645-8397 Fax:(907)586-7252 Jean H. Tel:(918)781-4600 Fax:(918)781-4604 Frank K. Tel:(615)564-6500 Fax:(615)564-6701 … Bureau of I.A. Regional Director Numbers Niles C. Tel:(800)645-8397 Fax:(907)586-7252 Jean H. Tel:(918)781-4600 Fax:(918)781-4604 Frank K. Tel:(615)564-6500 Fax:(615)564-6701 … Unstructured Contact Information Tel Fax Niles C. (800)645-8397 (907)586-7252 Jean H. (918)781-4600 (918)781-4604 Frank K. (615)564-6500 (615)564-6701 … 3
  • 4. 1. Use Excels 2. Write programs 3. ??? Traditional Data Transformation is Expensive Structured Contact Information Bureau of I.A. Regional Director Numbers Niles C. Tel:(800)645-8397 Fax:(907)586-7252 Jean H. Tel:(918)781-4600 Fax:(918)781-4604 Frank K. Tel:(615)564-6500 Fax:(615)564-6701 … Bureau of I.A. Regional Director Numbers Niles C. Tel:(800)645-8397 Fax:(907)586-7252 Jean H. Tel:(918)781-4600 Fax:(918)781-4604 Frank K. Tel:(615)564-6500 Fax:(615)564-6701 … Unstructured Contact Information Tel Fax Niles C. (800)645-8397 (907)586-7252 Jean H. (918)781-4600 (918)781-4604 Frank K. (615)564-6500 (615)564-6701 … 4
  • 5. Data Wrangler • “Wrangler: Interactive Visual Specification of Data Transformation Scripts”, Sean Kandel, Andreas Paepcke, Joseph Hellerstein and Jeffrey Heer, CHI 2011. 5
  • 6. Programming-By-Demonstration(PBD) Model • System learns the program by user demonstrating the actions required to complete the task Step 2: System Suggests Some Operations Step 3: User Picks an Operation Step 1: User Partially Demonstrates an Operation 6 No programming skills required! Possible to improve?
  • 7. Usability Issues with Programming-By- Demonstration Data Transformation 1. Repetitive and tedious • User effort is proportional to program length 2. Requires good knowledge about data transformation 7
  • 8. Our Proposal • Data Transformation through Programming-By-Example (PBE) • User provides input-output examples rather than selecting the correct operations • Previous data transformation PBE work by Gulwani et. al. • FlashFill [Gulwani 2011], FlashRelate [Barowy et. al. 2015], ProgFromEx [Harris, Gulwani 2011] 8
  • 9. Outline • Introduction • Problem Definition • Our Solution • Program Synthesis as a Search Problem • A New Heuristic: Table Edit Distance + batching • Experiments • Conclusion 9
  • 10. Expected PBE Interaction Model 10 Foofah Input-output Example Synthesized Program Raw Data
  • 11. User Input Input-output Example: ℰ= (ei, eo) ℰ = Sampled raw data ei + Transformed view of the sample ei ei eo 11 Machine Learning Data Visualization Data Integration
  • 12. Sample Input-output Example Bureau of I.A. Regional Director Numbers Niles C. Tel: (800)645-8397 Fax: (907)586-7252 Jean H. Tel: (918)781-4600 Fax: (918)781-4604 Frank K. Tel: (615)564-6500 Fax: (615)564-6701 … Raw data Bureau of I.A. Regional Director Numbers Niles C. Tel: (800)645-8397 Fax: (907)586-7252 Jean H. Tel: (918)781-4600 Fax: (918)781-4604 Sampled raw data Tel Fax Niles C. (800)645-8397 (907)586-7252 Jean H. (918)781-4600 (918)781-4604 Transformed view of the sample 12
  • 13. System Output Potter’s Wheel program: A sequence of parameterized Potter’s Wheel operators [Raman, Hellerstein 2001] • Potter’s Wheel program • Operators 𝓕: Drop, Move, Copy, Merge, Split, Fold, Unfold, Fill, Divide, Delete, Extract, Transpose 𝓕1[p1, p2,…](x) 𝓕2[p1, p2,…](x) 13 Raw Data “Evolved” Data More “Evolved” Data …maybe more
  • 14. Problem to Solve Synthesize a program 𝒫 that takes ei as the input and outputs eo ei eo Data Transformation Program 𝒫 ? 14 Machine Learning Data Visualization Data Integration
  • 15. Raw Data • A grid of values, i.e., spreadsheets • “Somewhat” structured: must have some regular structure or is automatically generated. 15 Health RecordsE-receiptSystem Log AND MORE …
  • 16. Desired Transformations • Supported: Layout transformations and/or syntactic transformations • Not Supported: Semantic transformations • E.g., May 17 2017 → 05 17 2017 16 Example of a syntactic transformation 05-16-2017 05-17-2017 05-18-2017 … 05/16/2017 05/17/2017 05/18/2017 … Example of a layout transformation
  • 17. Outline • Introduction • Problem Definition • Our Solution • Program Synthesis as a Search Problem • A New Heuristic: Table Edit Distance + batching • Experiments • Conclusion 17
  • 18. 18 Program Synthesis as a Search Problem Input Example ei Output Example eo ?
  • 19. 19 Program Synthesis as a Search Problem Input Example ei ? Output Example eo
  • 20. 20 Program Synthesis as a Search Problem Input Example ei ? Output Example eo
  • 21. 21 Program Synthesis as a Search Problem Input Example ei Output Example eo Search Space: All transformation programs States: Different views of the data Edges: Parameterized operators
  • 22. A* search algorithm keeps exploring the most promising node - smallest f(n) • f(n) = g(n) + h(n) A* Search Observed Part: traveled distance for the current state Intermediate state ‘n’ 22 Estimated Part: estimated distance to the goal state
  • 23. Applying A* Search Algorithm in Our Problem • Cost function: # of Potter’s Wheel operators 23 Input Example ei g(n): “# of Potter’s Wheel operators” Output Example eo
  • 24. Requirements for Heuristic 1. Easy to compute 2. Effective • Prioritizes correct states, punishes incorrect states 3. Heuristic scale is not too large • If heuristic score is too large, A* search could be misled by imperfect heuristic f(n) = g(n) + h(n) h(n) ≫ g(n) f(n) ≈ h(n) 24
  • 25. Possible Heuristics and Why They Are Bad 1. Operator-level heuristic • Estimate # of Potter’s Wheel operations to use 25 Requirements for heuristic 1. Easy to compute ❌ 2. Effective❓ 3. Close to g(n) scale ✔ Estimating # of Potter’s Wheel operators is hard. Details are in our paper.
  • 26. Possible Heuristics and Why They Are Bad 1. Operator-level • Estimate # of Potter’s Wheel operations to use 2. Column-level heuristic • # of different columns 26 Requirements for heuristic 1. Easy to compute ✔ 2. Effective ❌ 3. Close to g(n) scale ❓ Operators like Fold, Unfold, Delete, Transform, etc., move or remove cells across multiple columns
  • 27. Possible Heuristics and Why They Are Bad 1. Operator-level heuristic • Estimate # of Potter’s Wheel operations to use 2. Column-level heuristic • # of different columns 3. Cell-level heuristic 27 Requirements for heuristic 1. Easy to compute ✔ 2. Effective ❓ 3. Close to g(n) scale ❓ Intuition: • A correct operation makes table more similar to its desired state • The similar a table is to its desired state, the fewer cells need to be changed
  • 28. Outline • Introduction • Problem Definition • Our Solution • Program Synthesis as a Search Problem • A New Heuristic: Table Edit Distance + batching • Experiments • Conclusion 28
  • 29. Most data transformation operations can be seen as many cell-level transformation operations: • Add • Remove • Move • Transform Observation about Data Transformation Ops
  • 30. Cell-level operators Move a cell Alice Math A+ Transform a cell Mike Anderson University of Michigan PhD Student Alice A+ Math Mike Anderson University of Michigan PhD Add a cell Alice Math A+ Math Alice A+ Remove a cell Mike Anderson University of Michigan PhD Student Mike Anderson University of Michigan
  • 31. Example: Split operation Tel: (800)645-8397 Fax: (907)586-7252 Tel (800)645-8397 Fax (907)586-7252 Split on ‘:’ Tel: (800)645-8397 Fax: (907)586-7252 Tel (800)645-8397 Fax (907)586-7252 Transform (0,0) to (0,0) Potter’s Wheel Operator Cell-level Operator 31 Transform (0,0) to (0,1) Transform (1,0) to (1,0) Transform (1,0) to (1,1)
  • 32. • Table Edit Distance Definition: Distance 𝑇1, 𝑇2 = min 𝑝1,… , 𝑝 𝑘 ∈𝑃 𝑇1, 𝑇2 𝑖=0 𝑘 𝑐𝑜𝑠𝑡 𝑝𝑖 • P(T1, T2): Set of all “paths” transforming T1 to T2 using cell-level operators • p1,…, pk: cell-level operators, including Add, Remove, Move & Transform Cell-level heuristic: Table Edit Distance 32
  • 33. Current Cell-level Heuristic Is 33 1. Easy to compute ✔ 2. Effective: Prioritizing correct states, punishing incorrect states ✔ 3. Close to g(n) scale ❓
  • 34. • Data transformation operators (and their cell-level operators) transform geometrically-adjacent cells Tel: (800)645-8397 Fax: (907)586-7252 Tel: (918)781-4600 Fax: (918)781-4604 Tel (800)645-8397 Fax Tel Fax (907)586-7252 (918)781-4600 (918)781-4604 Observation about data transformation operators 1 Split = 8 Transforms on 2 groups of Vertically adjacent cells 34
  • 35. • Group geometrically-adjacent cell-level operations by their goemetric patterns Our Solution: Table Edit Distance Batch Tel: (800)645-8397 Fax: (907)586-7252 Tel: (918)781-4600 Fax: (918)781-4604 Tel Fax Tel Fax Tel: (800)645-8397 Fax: (907)586-7252 Tel: (918)781-4600 Fax: (918)781-4604 (800)645-8397 (907)586-7252 (918)781-4600 (918)781-4604 batched Transform 1 batch Transform 2 35 Requirements for heuristic 1. Easy to compute ✔ 2. Effective ✔ 3. Close to g(n) scale ✔ Table Edit Distance + Batch has exponential complexity. We use greedy algorithm. Detailed algorithm is in our paper
  • 36. Outline • Introduction • Problem Definition • Our Solution • Program Synthesis as a Search Problem • A New Heuristic: Table Edit Distance + batching • Experiments • Conclusions 36
  • 37. Claims Our proposed PBE technique: 1. Can handle most data transformation tasks 2. Requires a little user effort 3. Is generally efficient (small system runtime) 37
  • 38. Test Settings • Prototype • Foofah • Benchmark Workload • 50 test scenarios collected from related work: ProgFromEx [Harris, Gulwani 2011], Wrangler [Kandel et. al., 2011], Potter’s Wheel [Raman, Hellerstein 2001], ProactiveWrangler [Guo et. al. 2011] • Baselines • FlashRelate, ProgFromEx, Wrangler • Test Environment • a 16-core (2.53GHz) Intel Xeon E5630 machine with 120 GB RAM 38
  • 39. Test Approach: lazy approach [Harris, Gulwani 2011] 39 raw data Foofah raw data Input Example + Output Example synthesized program End perfectly transformed one record
  • 40. Foofah Handles Most Transformation Tasks 88.40% 97.70% 74.40% 55.80% 0% 20% 40% 60% 80% 100% Pure layout transformation Foofah FlashRelate ProgFromEx Wrangler 40 Conclusion: • Foofah can handle most of benchmark workload • Ideally, Wrangler should handle as many tasks as our system 100.00% 0.00% 0.00% 85.70% 0% 20% 40% 60% 80% 100% Include syntactic transformations Foofah FlashRelate ProgFromEx Wrangler
  • 41. Foofah Requires Few Inputs 41 Conclusion: • Foofah mostly requires a small input-output example (input example contains no more than 2 records) in our test workload 50.00% 40.00% 10.00% 0% 20% 40% 60% 80% 100% 1 2 Failure Number of input-output examples required for benchmark tests
  • 42. Foofah is Efficient 42 Average Runtime: 1.4 SecondsConclusion: • System runtime for each synthesis is short for most benchmark workload 74.00% 86.00% 88.00% 0% 20% 40% 60% 80% 100% ≤ 1 sec ≤ 5 sec ≤ 30 sec Worst-case system runtime for each synthesis
  • 43. Foofah Saves More User Effort than Wrangler 43 0 100 200 300 400 500 600 TotalCompletionTime(S) Wrangler FoofahConclusion • Foofah on average requires about 60% less user effort than Wrangler and never loses Test Settings: • 8 selected tasks covering both complex and simple tasks • 10 grad student participants • 600 seconds limit for each task
  • 44. Conclusion and Future Work • The proposed PBE data transformation technique: 1. Consumes less user effort than state-of-the-art PBD data wrangling system Data Wrangler 2. Solves more data transformation tasks than existing PBE data transformation systems • Next: • Build interface allowing the user to incorporate new operators • Add tolerance for inaccurate user-provided examples 44
  • 45. Thanks to prior works and, of course, you! 1. Kandel, Sean, et al. “Wrangler: Interactive visual specification of data transformation scripts.” CHI, 2011. 2. V. Raman and J. M. Hellerstein. “Potter’s Wheel: An interactive data cleaning system”. VLDB, 2001. 3. Gulwani, Sumit. "Automating string processing in spreadsheets using input- output examples." ACM SIGPLAN Notices. Vol. 46. No. 1. ACM, 2011. 4. Harris, William R., and Sumit Gulwani. "Spreadsheet table transformations from examples." ACM SIGPLAN Notices. Vol. 46. No. 6. ACM, 2011. 5. Barowy, Daniel W., et al. "FlashRelate: extracting relational data from semi- structured spreadsheets using examples." ACM SIGPLAN Notices. Vol. 50. No. 6. ACM, 2015. 6. Guo, Philip J., et al. "Proactive wrangling: mixed-initiative end-user programming of data transformation scripts." Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 2011. 45
  • 46. Niles C. Tel (800)645-8397 Niles C. Fax (907)586-7252 Unfold columns2, column3 on column3 Tel Fax Niles C. (800)645-8397 (907)586-7252 Tel (800)645-8397 Fax (907)586-7252 (907)586-7252(800)645-8397 Tel Fax Move (1,1) to (0,2) Niles C. Niles C. Niles C. Potter’s Wheel Operator Cell-level Operator Move (0,2) to (1,1) Delete (0,0) Demonstrating Example: Unfold operation 46
  • 47. Table Edit Distance Batch • Batch “geometric patterns” 1. Vertical-to-vertical, vertical-to-horizontal, horizontal-to-horizontal, horizontal-to-vertical 2. one-to-horizontal, one-to-vertical 3. Horizontal-to-null, vertical-to-null 47
  • 48. Experiments on Competing Algorithms 48
  • 50. Experiments on Adding New Operators 50

Editor's Notes

  1. Good morning, everyone. Thanks for coming to my talk. I am Mark from University of Michigan and we have three other incredible researchers: Mike Anderson, Mike Cafarella, and H. V. Jagadish.
  2. As we all know that most data in the real-world are unstructured data. Performing data analysis, visualization or simply load them to the database is almost impossible, unless they are transformed into a better form, e.g. a relational form
  3. I guess most people will bet their luck on editing tools like Excel to perform such a transformation. But the amount of the effort quickly becomes intolerable as long as the data is larger than just a few records. Normally, programmers are hired to write programs to automate such transformation process. Yet, these code are hard to maintain and programmers today are very expensive.
  4. Then is there any cheaper solution to data transformation?
  5. Thanks to Prof Joe Hellerstein and his collbaorators from berkeley and stanford, we got Data Wrangler, provides an easy-to-use visualized data transformation framework. Data Wrangler project is very successful and was turned into a San-Francisco based start-up Trifacta. Their product are used by tens of thousands of real users across the world, and was recently integrated as a data preparation service in google cloud platform. For simplicity, we will just focus on Data Wrangler prototype today.
  6. Data Wrangler follows a programming-by-demonstration paradigm to synthesize the data transformation programs: 1. User first select a region in the data table where she wants to perform some operations. It could a column, a row, or a cell. 2. The data wrangler system then suggests a set of data transformation operations applicable for this selected region. 3. The user picks one operation, and the system applies this selected operation on the current data and displays the transformed view on the data table. And the user goes through rounds of this process until she finally transform the data into the desired format. This seems to be easier than writing programs. But this might still be a lots of hassle for average users for the following reasons
  7. As opposed to programming by demonstration paradigm, we believe a programming by example paradigm is superior in reducing the user effort. Because all the system needs to know is user’s description about the input and output data. Therefore, in our project, we decided to follow pbe approach to synthesize a data transformation program. Note that we are not the first people who tried programming by example for data transformation. Dr. Gulwani from MSR has done many important work on this area. But we observe that most of their work are very restricted to specific types of data transformation. We will show this in our experiments.
  8. Let’s first clarify what is the problem that we are solving in our project
  9. Here is our proposed user interaction model.
  10. The user input to our proposed synthesis technique is the input-output example. The input example is a sample of the raw data and the output example is the transformed view of this sample.
  11. Bob’s case
  12. The system output is a loop-free program parameterized by PW operators. A loop-free program is a straight-line program parametrized by the set of operators/components used. The second operator takes in the output of the first operator and so and and so forth. No outer-loops or conditions exist in the program.
  13. Finally, the problem to solve in our project is to find such a loop-free program that takes in user-provided sampled raw data and outputs the transformed view of the sample.
  14. The raw data we want to transform in our project is a grid of values, an example could be spreadsheets. The data has to follow some regular structures which means these data are usually automatically generated. Some examples could be system log data (txt data that are in csv format, where splitter is space or \t are taken in as tabular form). A counter example could be twitter posts, news articles.
  15. At this moment, we do not consider semantic transformations. A simple example of semantic transformation is transforming ”May” into “05”. Such transformation requires outside information which our system currently does not incorporate
  16. From our view, the process of synthesized program is similar to a search problem.
  17. When we picture our problem in this way: this is the input example provided by the user as an initial state, this is the output example as a goal state, the sequence of operations is like a path that connects input and output example.
  18. A* search algorithm is an approach that is commonly used to solve the search problem. A* keeps exploring the most promising intermediate state, which is the state with the smallest f score, until it reaches the goal state. F score is the sum of traveled distance and heuristic, which is the estimated distance to the goal state.
  19. We define table edit distance as the heuristic function. But in practice, it is not a perfect heuristic because it is much larger than potter’s wheel distance because it is proportional to the data size. And when this is the case. A* algorithm is simply a greedy algorithm and can be mislead by the heuristic severely. To alleviate this issue, we create a fix based on another observation.
  20. With these requirements in our mind let’s brainstorm some of the possible heuristic design. Intuitively, we might want to have an operator-level heuristic which estimates the actual number of potter’s wheel operators to use. But this is hard unless we already know what operator to use. Therefore, it is not easy to compute and is unclear whether it is effective.
  21. Since estimating the # of operators is hard, let’s try some simple heuristic, like a column-level heuristic. For example, we can estimate # of different columns in two tables. Although, this is easy to compute, it is essential very inaccurate because many operators transform across multiple columns.
  22. In our project, we decided to use a cell-level heuristic based on our heuristic that #1 a correct transformation makes data more similar to its desired state, #2 the similar a table is to its desired state.
  23. Add the naïve heuristic section
  24. We made an observation that almost all of the Potter’s Wheel operations can be seen as many cell-level transformation operation. Split for example can be seen as cell transform operations
  25. We made an observation that almost all of the Potter’s Wheel operations can be seen as many cell-level transformation operation. Split for example can be seen as cell transform operations
  26. We made an observation that almost all of the Potter’s Wheel operations can be seen as many cell-level transformation operation. Split for example can be seen as cell transform operations
  27. Now let’s be clear about the cell-level distance metric we are discussing here. To transform a table into another form, there might be multiple ways using cell-level operations. The distance metric is the cheapest way among all possible approches. The cell-level operations are add, remove, move and. transform.
  28. Graph edit distance is known to be np-complete. Therefore we choose to approximate our table edit distance using greedy method. We simply choose the cheapest operation to formulate each cell in the output table sequentially and remove the remaining cells from the input table. The distance is the sum of cost of the all operations.
  29. We observe that potter’s wheel operators transform geometrically adjacent cells. And the corresponding cell-level transformations are also transforming geometrically-adjacent cells.
  30. In our experiment we want to demonstrate that our proposed pbe data transformation technique can solve many real world data transformation tasks, and it reduces the user effort.
  31. We implemented our proposed technique in a prototype called foofah. We collected benchmarking test scenarios from several related work. Our selection criteria is the transformation has to generate a relational tabular data. In total 50 test scenarios are selected. The testing environment is as follows.
  32. We first want to evaluate the performance of foofah in all test scenarios. We follow a lazy approach that is used by other related PBE work. We pick one record and create the transformed view of this record and send them as an input-output example to our system. If the synthesized program is not good enough, we use this input example and another record from the raw data that the synthesized program fails to handle to formulate a new input example. We then create the transformed view of this new example and send it together with the new input example as input to our system and test again. We interact with foofah in this manner until we finally get a high quality program.
  33. roughly tie with other system on layout transformations and better than other in synatactic transformation
  34. The results show that Foofah is able to find perfect programs for 90% of the test scenarios using 1 or 2 records. In the rest 5 test scenarios. 4 of them are due of the lack of appropriate operators. We also measured the responsiveness. The graph shows the average and worst response time in each round among all test scenarios. 74% within 5 seconds and 86% within 10 seconds. Average is 1.4 seconds
  35. The result shows that foofah on average reduce 60% user effort. And more in those that require complex operators. Due to the differences in the interaction model. Wrangler needs more mouse clicks and foofah needs more key strokes.
  36. A more complex example is unfold, which can be seen as moving two cells and deleting a cell.