Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Final VIPER presentation at BioVis 2013

258 views

Published on

BioVis 2013 Presentation of VIPER paper

J. Kennedy, M. Graham, T. Paterson, and A. Law, "Visual Cleaning of Genotype Data," Proc. 3rd IEEE Symposium on Biological Data Visualization, pp. 105-112, 2013, doi:10.1109/BioVis.2013.6664353.

The videos are missing, and the animations on the error inheritance slides are all messed up after slideshare conversion... but everything else is ok.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Final VIPER presentation at BioVis 2013

  1. 1. Visual Cleaning of Genotype Data Jessie Kennedy, Martin Graham Edinburgh Napier University Trevor Paterson, Andy Law The Roslin Institute, University of Edinburgh
  2. 2. Background • VIPER is a visualisation for spotting areas of error (impossible inheritance) in pedigree genotype datasets Many More Markers, with similar data per marker Pedigree structure G | G T | A G | G G | G G | T G | A T | C
  3. 3. Background • The visualisation aggregated errors across markers and displayed them as offspring groups – Along with ancillary tables and bar charts • For it to be a useful biological tool , it needed extended to become a data cleaning application
  4. 4. Background • Data Wrangling – Fixing unreliable or useless data – General Purpose vs Specific Task • General Purpose Tools – Wrangler / Google Refine – Tabular data • Ours is a Specific Task – Remove the errors as they break further analyses – Fixing errors often creates new ones as our data is an inheritance graph of related data rather than a table
  5. 5. Background • Error Visualisation Topics (in order of vol of work) – Uncertainty visualisation – show bounds of reliability – Missing data visualisation – is data present • Usually the bane of visualisation rather than the aim – Correctness visualisation – is data right
  6. 6. Data Cleaning • We cover missing data and correctness. For us... – Incorrect data – bad. – Missing (incomplete) data – manageable. • Cleaning ≠ Correcting – Correction is preferable, but often impossible • We clean by deleting erroneous data points and inferring data from ancestor individuals – We swap wrong data for missing data
  7. 7. Data Cleaning - Operations • Four basic masking operations 1. Mask markers 2. Mask individuals 3. Mask single data points 4. Break relationships
  8. 8. Data Cleaning - Markers • Markers are independent of each other. – Masking one marker doesn’t change the errors in any other markers • Thus markers with lots of errors can be quickly removed with no side-effect – Early version in VIPER hid errors (but didn’t do anything to the underlying data)
  9. 9. Data Cleaning - Individuals • Wanted to adopt the same approach... – But something odd happened. – Removing individuals changes the error counts of other individuals • Because individuals inherit from each other • So e.g. Removing every individual with > 5 errors produced individuals with >5 errors.
  10. 10. Data Cleaning - Individuals • Some errors turned out to simply drop from one generation to the next – Literal “chase to the bottom”, lots of lost data • In these situations it is often necessary to break a child/parent relationship across all markers in the pedigree – Which is where the fourth masking operation originates
  11. 11. Masking - 1 A/G G/T C/C G/C A/G C/G G/T G/C G/A C/A C/C www.napier.ac.uk/iidi C/C G/G G/T G/G G/C C/A A/C G/C G/G Mask all errors Recheck for errors Repeat Lose 50% of data
  12. 12. Masking - 2 A/G G/T C/C G/C A/G C/G G/T G/C G/A C/A C/C www.napier.ac.uk/iidi C/C G/G G/T G/G G/C C/A A/C G/C G/G Mask errors top down Recheck Lose 25% for of errors data Repeat
  13. 13. Masking - 3 A/G G/T C/C G/C A/G C/G G/T G/C G/A C/A C/C www.napier.ac.uk/iidi C/C G/G G/T G/G G/C C/A A/C G/C G/G Mask errors top down + Lose cut links <20% of data Recheck for errors Repeat
  14. 14. Showing Missing • Masked and missing data are shown in a different colour to error data
  15. 15. Representations • Being careful not to use any other colours in the interface, we can see how cleaning is going (red vs blue) • New masking interactions available through standard context menus (and through tables)
  16. 16. Visual History • With such a hypothetical / experimental method of cleaning errors, undo is a must – Part of Shneiderman’s mantra – Beyond single-step, branching history
  17. 17. Final Interface
  18. 18. Experiment • Genotype Checker vs VIPER+ interfaces • Both run using the same underlying data checking algorithm • Same dataset • 11 Biologists/Geneticists/Bioinformaticians at The Roslin Institute • Asked them to attempt a pair of representative tasks with both interfaces (split into 12 Q’s)
  19. 19. Experiment - Objective • Over the whole question set there was no objective difference, but one did emerge when we considered questions that involved pedigree exploration 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 GenotypeChecker Viper
  20. 20. Experiment - Objective • Over the whole question set there was no objective difference, but one did emerge when we considered questions that involved pedigree exploration 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 Genotype Checker VIPER
  21. 21. Experiment - Subjective Question VP No Pref GC Finding structural information on a pedigree 7 1 2 1 0 Finding descendents of an individual 8 2 0 1 0 Finding ancestors of an individual 7 3 1 0 0 Finding error information on a single individual 4 1 1 4 1 Finding error information on a single marker 3 3 2 3 0 Distinguishing between different types of error 7 2 2 0 0 Tracing errors to a shared parent 8 0 2 1 0 Finding error information on a single family 7 1 2 1 0 Comparing errors between related families (one shared parent) 8 1 1 1 0 Masking errors 1 2 4 3 1 Overall understanding of errors 5 1 4 1 0 Overall ease of use 5 2 3 0 1 Key: 1 = Strongly prefer Viper, 5 = Strongly prefer GC, Bold = Median
  22. 22. Experiment - Observations • A lot of incorrect/skipped answers in both scenarios – GC 61/132 = 46% – VP 45/132 = 34% • These users were occasional users of cleaning software but it does show that Pedigree Cleaning is hard • Excelitis – Biologists love Excel. The first move of many was to investigate the tables of error info rather than the main pedigree visualisation
  23. 23. End • Thanks for listening • Sponsored by BBSRC • http://www.bioinformatics.roslin.ed.ac.uk/viper/

×