SlideShare a Scribd company logo
1 of 20
Maximizing Correctness with Minimal User Effort
to Learn Data Transformations
Bo Wu and Craig Knoblock
University of Southern California
1
Department of Computer Science
2
Art website Buyer
3
Dimension of artworks
4
Programming by Example
Video is from Excel YouTube official channel (https://www.youtube.com/watch?v=YPG8PAQQ894)
Too Many Records
5
Overconfident Users
6
Users are often too confident to examine the results thoroughly
Variations
7
Problem
Enable the users of PBE systems to achieve maximal
correctness with minimal effort on large datasets
8
Help users to identify at least one of all incorrect
records in every iteration with minimal effort on
large datasets
Approach Overview
9
Raw Transformed
10“ H x 8” W 10
H: 58 x
W:25”
58
12”H x 9”W 12
11”H x 6” 11
… …
30 x 46” 30 x 46
Entire dataset
Random
Sampling
Raw Transformed
10“ H x 8” W 10
11”H x 6” 11
… …
30 x 46” 30 x 46
Sampled records
Verifying records
Raw Transformed
11”H x 6” 11
30 x 46” 30 x 46
… …
Sorting and
color-codingRaw Transformed
30 x 46” 30 x 46
11”H x 6” 11
… …
Learning from users’ feedback
10
Verifying Records
• First recommend records causing runtime
errors
– Records cause the program exit abnormally
• Second recommend potentially incorrect
records
– Learn a binary meta-classifier
11
Input: 2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic
Raw Transformed
11”H x 6” 11
30 x 46” 30 x 46
… …
Ex:
Learning the Meta-classifier
12
cs1
…
Meta-classifier
cs2
cs4 cs3
cp1
…
cp2
cp3 cp4
cf1
…
cf2
cf3 cf4
Program agreement
Format ambiguity
Similarity
cs3
cs4
cp2
cf1
w1
w2
w3
w4
…
Evaluation
• The recommendation contains incorrect
records
13
Evaluation
• The recommendation can place incorrect
records on top
14
User study
15
Experiment setup:
• 5 scenarios with 4000 records per scenario
• 10 graduate students divided into two groups
Summary and Future Work
• Summary
– Sample records
– Identify incorrect/questionable records
– Allow user to refine the recommendation
– Color-code the results
• Future work
– Show histograms of the data
– Translate the program to readable natural text
16
17
Questions ?
Data and system available at
https://github.com/areshand/Web-Karma
Type of Classifiers
• Classifier based on distance
• Classifier based on agreement of programs
• Classifier based on format ambiguity
18
Learning from various past results
19
…
Raw Transformed
26" H x 24" W x 12.5 26
Framed at 21.75" H x 24.25” W 21
12" H x 9" 12
…
Raw Transformed
Ravage 2099#24 (November, 1994) November, 1994
Gambit III#1 (September, 1997) September, 1997
(comic) Spidey Super Stories#12/2
(September, 1975)
comic
…
Examples
Incorrect
records
Correct
records
Sorting Records
20
Runtime errors
Rank records
using #failed_subprograms
Rank records
using meta-classifier output
Yes
No
Checking
transformed
records
Record #failed_subprograms
2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic 3
1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia) 2

More Related Content

Similar to Maximizing Correctness with Minimal User Effort to Learn Data Transformations

Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databases
Rui Vieira
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithms
arx-deidentifier
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
Ian Foster
 

Similar to Maximizing Correctness with Minimal User Effort to Learn Data Transformations (20)

Skillwise Big data
Skillwise Big dataSkillwise Big data
Skillwise Big data
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databases
 
Big data
Big dataBig data
Big data
 
IFAC MIM 2013
IFAC MIM 2013IFAC MIM 2013
IFAC MIM 2013
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Giab ashg webinar 160224
Giab ashg webinar 160224Giab ashg webinar 160224
Giab ashg webinar 160224
 
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
AL slides.ppt
AL slides.pptAL slides.ppt
AL slides.ppt
 
IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079
 
HOP-Rec_RecSys18
HOP-Rec_RecSys18HOP-Rec_RecSys18
HOP-Rec_RecSys18
 
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
Unsupervised Learning (D2L6 2017 UPC Deep Learning for Computer Vision)
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithms
 
SMS Module-4(theory) ppt.pptx
SMS Module-4(theory) ppt.pptxSMS Module-4(theory) ppt.pptx
SMS Module-4(theory) ppt.pptx
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Introduction to Generalised Low-Rank Model and Missing Values
Introduction to Generalised Low-Rank Model and Missing ValuesIntroduction to Generalised Low-Rank Model and Missing Values
Introduction to Generalised Low-Rank Model and Missing Values
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
 

Recently uploaded

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 

Recently uploaded (20)

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 

Maximizing Correctness with Minimal User Effort to Learn Data Transformations

  • 1. Maximizing Correctness with Minimal User Effort to Learn Data Transformations Bo Wu and Craig Knoblock University of Southern California 1 Department of Computer Science
  • 4. 4 Programming by Example Video is from Excel YouTube official channel (https://www.youtube.com/watch?v=YPG8PAQQ894)
  • 6. Overconfident Users 6 Users are often too confident to examine the results thoroughly
  • 8. Problem Enable the users of PBE systems to achieve maximal correctness with minimal effort on large datasets 8 Help users to identify at least one of all incorrect records in every iteration with minimal effort on large datasets
  • 9. Approach Overview 9 Raw Transformed 10“ H x 8” W 10 H: 58 x W:25” 58 12”H x 9”W 12 11”H x 6” 11 … … 30 x 46” 30 x 46 Entire dataset Random Sampling Raw Transformed 10“ H x 8” W 10 11”H x 6” 11 … … 30 x 46” 30 x 46 Sampled records Verifying records Raw Transformed 11”H x 6” 11 30 x 46” 30 x 46 … … Sorting and color-codingRaw Transformed 30 x 46” 30 x 46 11”H x 6” 11 … …
  • 10. Learning from users’ feedback 10
  • 11. Verifying Records • First recommend records causing runtime errors – Records cause the program exit abnormally • Second recommend potentially incorrect records – Learn a binary meta-classifier 11 Input: 2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic Raw Transformed 11”H x 6” 11 30 x 46” 30 x 46 … … Ex:
  • 12. Learning the Meta-classifier 12 cs1 … Meta-classifier cs2 cs4 cs3 cp1 … cp2 cp3 cp4 cf1 … cf2 cf3 cf4 Program agreement Format ambiguity Similarity cs3 cs4 cp2 cf1 w1 w2 w3 w4 …
  • 13. Evaluation • The recommendation contains incorrect records 13
  • 14. Evaluation • The recommendation can place incorrect records on top 14
  • 15. User study 15 Experiment setup: • 5 scenarios with 4000 records per scenario • 10 graduate students divided into two groups
  • 16. Summary and Future Work • Summary – Sample records – Identify incorrect/questionable records – Allow user to refine the recommendation – Color-code the results • Future work – Show histograms of the data – Translate the program to readable natural text 16
  • 17. 17 Questions ? Data and system available at https://github.com/areshand/Web-Karma
  • 18. Type of Classifiers • Classifier based on distance • Classifier based on agreement of programs • Classifier based on format ambiguity 18
  • 19. Learning from various past results 19 … Raw Transformed 26" H x 24" W x 12.5 26 Framed at 21.75" H x 24.25” W 21 12" H x 9" 12 … Raw Transformed Ravage 2099#24 (November, 1994) November, 1994 Gambit III#1 (September, 1997) September, 1997 (comic) Spidey Super Stories#12/2 (September, 1975) comic … Examples Incorrect records Correct records
  • 20. Sorting Records 20 Runtime errors Rank records using #failed_subprograms Rank records using meta-classifier output Yes No Checking transformed records Record #failed_subprograms 2008 Mitsubishi Galant ES $7500 (Sylmar CA) pic 3 1998 Honda Civic 12k miles s. Auto. - $3800 (Arcadia) 2

Editor's Notes

  1. Ashley wants to buy a painting for the space over her sofa She has strict space limits. Ex: the painting should be about 60’’ wide and 40’’ high
  2. Ashley got a spreadsheet of artworks on sale. The sizes information that she got is a long list of entries with the height, width and even depth in one entry. She has to split them into three columns and remove some extra text such as “H:”, “in.”, etc. Thus, she can then filter the artworks based on each degree’s size. Dataset has so many records that it requires her to write programs to solve problem. Problem: a long learning curve to learn this skill. The time should be used to decorate her house instead.
  3. Programming by example doesn’t require users to write code anymore.
  4. The list can have thousands of records. It is really hard to notice some records in the middle that are transformed incorrectly.
  5. According to previous research, User often believe that they have carefully examined all the records. They stop checking the results when there is still a large percentage of incorrect records in the dataset.
  6. To identify the Cannot rely on single rule or
  7. Random sampling is to address the too many records problem Verifying records can capture incorrect records in various scenarios Sorting and color-coding is to address over confident user problem Can also learn from the users interaction in current iteration to refine the recommendation
  8. Learn from the users feedback to refine the recommendation
  9. First, describe correctness Second, iteration time Third, total time. explain why certain scenarios have longer total time. Why in s5 and s3 beta has twice the iteration time as our approach? Why does the iteration time in beta varies much more than the times in our approach?
  10. Summary vs Conclusion