Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Final Viper Prototype Presentation


Published on

Presentation describing final VIPER prototype for cleaning animal pedigree datasets

  • Be the first to comment

  • Be the first to like this

Final Viper Prototype Presentation

  1. 1. VISUALISING ERRORS INANIMAL PEDIGREEGENOTYPE DATAMartin Graham, Jessie Kennedy, Trevor Paterson & AndyLawEdinburgh Napier University & The Roslin Institute, Univ ofEdinburgh, UK
  2. 2. 2 years ago at Firbush... I said: “Aim is to develop interactive tools to locate and isolate errors in pedigree genotype data in their datasets” Where a  Pedigree= Family tree of related animals  Genotype = Genetic makeup of an organism
  3. 3. Inheritance Basics (Very) Humans have DNA They in fact have 2 lots of DNA (diploidy), which may or may not match at certain points   Two lots of DNA bundled in a chromosome When two parents produce offspring, one lot of DNA is passed onto the child from each parent  Which lot is used changes just to shuffle things up a bit more
  4. 4. Inheritance Basics (Very) By looking at many, many Single Nucleotide Polymorphisms markers (points where we know things vary between individuals at the level of single DNA letters) we can check for errors A G A C A C If one letter from each parent at these points turns up in the same place in the child’s DNA everything is good
  5. 5. Errorz But inevitably.... Nothing inherited from mum  Errorscreep in for various reasons, bad record- A G C C C C keeping, observations... Nothing inherited from dad A G C A G G Novel allele. No inheritance from one parent, but we  Muddled DNA can’t tell which... sampling, animals “jumping A G C A T A the fence” etc etc  Unusable data in this state
  6. 6. Thus There is a constant need to clean up pedigree data Roslin have a tool that views data as a table (markers by individuals), so pedigree-based patterns to error, such as the wrong dad for an entire set of offspring, were very hard to spot So they wanted a new tool, with a funky
  7. 7. Layouts So (2 years ago) we looked at pedigree layouts  And they were all rubbish
  8. 8. Layouts Didn’t scale, became intractable to follow relationships, couldn’t resolve generations, often only individual-out views rather than whole pedigree etc
  9. 9. Layouts So we developed what we called the sandwich view. Between neighbouring generations, we draw  Dads as the top slice of bread  Mums as the bottom slice of bread  Kids as the filling  Errors colour-coded across the marker set, more
  10. 10. Layouts Each family forms a block between the respective mum and dad, making it easy to see who is who’s offspring/parents Layout works as males mate with multiple females in each generation but the opposite is rare
  11. 11. Layouts Each child forms a glyph used to show error Divided into three parts  Up triangle coloured if error with dad  Down triangle coloured if error with mum  Middle band coloured if error, but parent in error is unknown (novel allele) Lo, pedigree-based error patterns revealed themselves
  12. 12. Layouts Tables full of data and histograms to show error distribution by marker and individuals also help
  13. 13. Cleaning So, we can show errors nicely But the aim is to get rid of all these errors Masking is when we pretend we don’t know the values for particular markers / individuals / combinations thereof What happens then is that those values are inferred from the corresponding values in the parents A G G C A G C C C C ? ? C C C C
  14. 14. Cleaning The visualisations lets the biologist mask individuals / bunches of markers / individual genotype points / relationships These are then shown in blue in the interface
  15. 15. Cleaning This last point’s important as pedigree errors just propagate down the pedigree. A wrong parent for a child can’t be cured by hiding the child It’s also why we cant clean these data sets automatically, the biologists judgement in what
  16. 16. The Goal Eventually we want a display with no nasty red colours and then we can save it as a “clean” data set  Though obviously with lots of missing data  But the biologists say their tools can handle missing things, but wrong things blow them up  And we did have to stick in a final “auto clean up” button to fix sporadic errors that would have taken ages to fix manually  But the major systematic errors are fixed by the biologist
  17. 17. User Test We did a user test with 11 biologists at Roslin They preferred the new tool to the table-like tool Probably the most interesting thing past the numbers was once again how much a bunch of scientists are in thrall to Excel  Just like the taxonomists we’ve worked with / social scientists we’re writing a proposal with  Which is why the Roslin guys made a table-a-like tool in the first place to try and appease them
  18. 18. Conclusion Built successful tool (got it published in EuroVis, BioVis and AVI) Whether it’s successful from the biologists point of view...  During the project, marker set sizes jumped from thousands to hundreds of thousands  Sequencing the data used to be the costly part of the process, staff time to clean it up was relatively cheap  Biology in general is having a data crisis, some opinions say its cheaper/easier to redo experiments than store the TBs of information
  19. 19. Conclusion Available at Did do JavaDocs this time I enjoyed it