Successfully reported this slideshow.

Final Viper Prototype Presentation

1,840 views

Published on

Presentation describing final VIPER prototype for cleaning animal pedigree datasets

  • Be the first to comment

  • Be the first to like this

Final Viper Prototype Presentation

  1. 1. VISUALISING ERRORS INANIMAL PEDIGREEGENOTYPE DATAMartin Graham, Jessie Kennedy, Trevor Paterson & AndyLawEdinburgh Napier University & The Roslin Institute, Univ ofEdinburgh, UK
  2. 2. 2 years ago at Firbush... I said: “Aim is to develop interactive tools to locate and isolate errors in pedigree genotype data in their datasets” Where a  Pedigree= Family tree of related animals  Genotype = Genetic makeup of an organism
  3. 3. Inheritance Basics (Very) Humans have DNA They in fact have 2 lots of DNA (diploidy), which may or may not match at certain points   Two lots of DNA bundled in a chromosome When two parents produce offspring, one lot of DNA is passed onto the child from each parent  Which lot is used changes just to shuffle things up a bit more
  4. 4. Inheritance Basics (Very) By looking at many, many Single Nucleotide Polymorphisms markers (points where we know things vary between individuals at the level of single DNA letters) we can check for errors A G A C A C If one letter from each parent at these points turns up in the same place in the child’s DNA everything is good
  5. 5. Errorz But inevitably.... Nothing inherited from mum  Errorscreep in for various reasons, bad record- A G C C C C keeping, observations... Nothing inherited from dad A G C A G G Novel allele. No inheritance from one parent, but we  Muddled DNA can’t tell which... sampling, animals “jumping A G C A T A the fence” etc etc  Unusable data in this state
  6. 6. Thus There is a constant need to clean up pedigree data Roslin have a tool that views data as a table (markers by individuals), so pedigree-based patterns to error, such as the wrong dad for an entire set of offspring, were very hard to spot So they wanted a new tool, with a funky
  7. 7. Layouts So (2 years ago) we looked at pedigree layouts  And they were all rubbish
  8. 8. Layouts Didn’t scale, became intractable to follow relationships, couldn’t resolve generations, often only individual-out views rather than whole pedigree etc
  9. 9. Layouts So we developed what we called the sandwich view. Between neighbouring generations, we draw  Dads as the top slice of bread  Mums as the bottom slice of bread  Kids as the filling  Errors colour-coded across the marker set, more
  10. 10. Layouts Each family forms a block between the respective mum and dad, making it easy to see who is who’s offspring/parents Layout works as males mate with multiple females in each generation but the opposite is rare
  11. 11. Layouts Each child forms a glyph used to show error Divided into three parts  Up triangle coloured if error with dad  Down triangle coloured if error with mum  Middle band coloured if error, but parent in error is unknown (novel allele) Lo, pedigree-based error patterns revealed themselves
  12. 12. Layouts Tables full of data and histograms to show error distribution by marker and individuals also help
  13. 13. Cleaning So, we can show errors nicely But the aim is to get rid of all these errors Masking is when we pretend we don’t know the values for particular markers / individuals / combinations thereof What happens then is that those values are inferred from the corresponding values in the parents A G G C A G C C C C ? ? C C C C
  14. 14. Cleaning The visualisations lets the biologist mask individuals / bunches of markers / individual genotype points / relationships These are then shown in blue in the interface
  15. 15. Cleaning This last point’s important as pedigree errors just propagate down the pedigree. A wrong parent for a child can’t be cured by hiding the child It’s also why we cant clean these data sets automatically, the biologists judgement in what
  16. 16. The Goal Eventually we want a display with no nasty red colours and then we can save it as a “clean” data set  Though obviously with lots of missing data  But the biologists say their tools can handle missing things, but wrong things blow them up  And we did have to stick in a final “auto clean up” button to fix sporadic errors that would have taken ages to fix manually  But the major systematic errors are fixed by the biologist
  17. 17. User Test We did a user test with 11 biologists at Roslin They preferred the new tool to the table-like tool Probably the most interesting thing past the numbers was once again how much a bunch of scientists are in thrall to Excel  Just like the taxonomists we’ve worked with / social scientists we’re writing a proposal with  Which is why the Roslin guys made a table-a-like tool in the first place to try and appease them
  18. 18. Conclusion Built successful tool (got it published in EuroVis, BioVis and AVI) Whether it’s successful from the biologists point of view...  During the project, marker set sizes jumped from thousands to hundreds of thousands  Sequencing the data used to be the costly part of the process, staff time to clean it up was relatively cheap  Biology in general is having a data crisis, some opinions say its cheaper/easier to redo experiments than store the TBs of information
  19. 19. Conclusion Available at www.viper-project.org Did do JavaDocs this time I enjoyed it

×