Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a Mutation History Tree

2,013 views

Published on

Maurice Gleeson's presentation on supercharging your project members from the 2015 International Conference on Genetic Genealogy.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Building a Mutation History Tree

  1. 1. Combining SNPs, STRs, & Genealogy to build a Surname Origins Tree Dr Maurice Gleeson 11th Annual FTDNA Conference 15th Nov 2015 http://gleesondna.blogspot.co.uk/ YouTube – DNA and Family History Research
  2. 2. Google: YouTube Genetic Genealogy Ireland
  3. 3. A Combined Mutation / Family History Tree … using DNA markers when people run out … is it possible? Can you do it?
  4. 4. Topics for Discussion • Building a tree with STRs • Building a tree with SNPs • Combining STRs & SNPs • Dating branching points in the tree • Combining STRs, SNPs & genealogy • Opportunities for the years ahead
  5. 5. Topics for Discussion • Building a tree with STRs • Building a tree with SNPs • Combining STRs & SNPs • Dating branching points in the tree • Combining STRs, SNPs & genealogy • Challenges for the years ahead
  6. 6. Modal Haplotype for Lineage II • Lots of Parallel Mutations! o Back Mutations remain hidden • Is resolution enough to define the tree? • Is this the “best fit” model? 570 (17-18) CDYa (38>39) CDYa (38>39) 3 Branch numbers
  7. 7. Courtesy of Ralph Taylor G64 G39 Fluxus cladogram • It can help - useful to check against the Hand-Drawn Tree • Shows “maximum parsimony” version • Cumbersome, fiddly, easy to make mistakes, difficult to interpret, time-consuming • Difficult to visualise as a “Family Tree” • Gives all markers equal weight & ignores differing mutation rates www.isogg.org/wiki/Cladogram
  8. 8. Courtesy of Ralph Taylor G64 G39 Fluxus cladogram • Several “Best Fit” models - at least 8 BF models … - Tree is not anchored • No single “most likely” option • So not enough information at 37 markers to define the branching pattern • Parallel Mutations still persist - 390, 392, CDYa&b • Back Mutations also possible • Not clear which mutation came before which www.isogg.org/wiki/Cladogram
  9. 9. 570 (17-18) CDYa (38>39) CDYa (38>39) Hand Drawn Tree 570 (17-18) CDYa (38>39) CDYa (38>39) Fluxus Tree v1 Branch numbers
  10. 10. 570 (17-18) CDYa (38>39) CDYa (38>39) Hand Drawn Tree 570 (17-18) CDYa (38>39) CDYa (38>39) Fluxus Tree v1 Branch numbers
  11. 11. Fluxus Cladogram (111 markers) G64 G39 G73 G64 G39 Fluxus Cladogram (37 markers) www.isogg.org/wiki/Cladogram Courtesy of Ralph Taylor
  12. 12. Essential technology for project success
  13. 13. (37 markers)
  14. 14. Fluxus Cladogram (111 markers) G64 G39 G73 G64 G39 Fluxus Cladogram (37 markers) Courtesy of Ralph Taylor • No weighting … but mutation rates vary by a factor of 400 • James Irvine developed an algorithm for weighting markers weighting = 99* (1 – mutation rate/0.04)2 https://en.wikipedia.org/wiki/List_of_Y-STR_markers
  15. 15. www.isogg.org/wiki/Cladogram Courtesy of Ralph Taylor • Torso disappears • No alternative pathways = 1 single “Best Fit” model Fluxus Cladogram (111 markers) G64 G39 G73 Fluxus Cladogram (111 markers, weighted)
  16. 16. Some markers behave unusually • Marker 389: this is tested in 2 parts – mutation in Part 1 is also counted in Part 2 => so just use Part 2 (389ii) … and we did! – www.familytreedna.com/learn/y-dna-testing/y-str/different-str-markers-dys389i-dys398ii- dys389-2-result-family-tree-dna-different-genographic-project/ • Multi-copy markers 464abcd (but also 385, 459, YCAII, CDY, DYF395S1, 413) – mutations in multi-copy markers may not be in the correct order – Kittler test defines relative positions for 385 … not applicable here? – www.familytreedna.com/learn/y-dna-testing/y-str/infinite-allele-palindromic-markers/ – http://www.isogg.org/wiki/DYS_464 • Multi-copy marker 464abcd: 2 types = c & g – 464x test defines which type (but not position) … not accounted for! – http://www.dna-fingerprint.com/static/PalindromicPres.pdf • 464abcd, CDYa & b: fast-mutating palindromic markers – http://www.isogg.org/wiki/RecLOH
  17. 17. Fluxus Cladogram (111 markers, weighted) Fluxus Cladogram (111 markers, weighted, no CDY,464)
  18. 18. Which is more accurate? with or without CDY & 464? or some version in between?
  19. 19. How likely is it that 464 & CDY will screw things up? • Gleeson surname origin = 1000 AD  Surname has had 1000 years to mutate = 33.3 generations (30 y/gen) • How many mutations would you expect in 1000 years? • CDY mutation rate = 0.03531 / gen = 1.176 per member = c.16 mutations for all 14 branches of Lineage II Observed rate is 4 for CDYa, and 3 for CDYb => 12/16 and 13/16 mutations respectively are hidden? – So predictions based on CDY will be incorrect (12/16 + 13/16)/2 = 78% of the time? • 464 mutation rate = 0.00566 / gen = 0.188 per member = 2.6 per 14 members (on each of 464abcd) Observed rate is 0 for 464a & d, and 2 for 464b & c => 2.6/2.6 & 0.6/2.6 mutations respectively are hidden? – So predictions based on 464 will be incorrect 62% of the time? https://en.wikipedia.org/wiki/List_of_Y-STR_markers
  20. 20. How likely is it that 464 & CDY will screw things up? • Less of a problem in those branches related within the last 200-300 years? – less time to mutate back – lower chance of back mutations – more useful for branch-defining • More of a problem with those branches more distantly related (600-1000 yrs)? – more time to mutate back – higher chance of back mutations – less useful for branch-defining  Choose v3a (i.e. use CDY & 464 data) • Tree will be less than 100% correct • Be especially wary of mutations in more distant reaches of the tree https://en.wikipedia.org/wiki/List_of_Y-STR_markers
  21. 21. Y-12 HDT Y-37 HDT
  22. 22. Caveats & Limitations • Missing data – Fluxus fills in the blanks - is its “best guess" valid? – No adequate mutation rates for many markers • The Tree is not yet “anchored” – Moreso in the upper reaches of the tree (sub-branches seem stable) – Several interpretations are still possible, even at 111 markers (v3a vs v4) – Will this reduce as more people test? or upgrade? – Are there hidden Back Mutations? • Tree may be skewed by recent mutations (last 5-6 generations) => Triangulate on each MDKA – Test at least 2 known distant cousins from each family branch in order to characterise the haplotype of each MDKA – Helps eliminate recent mutations which might cloud the interpretation – Costly … $339 for a 111 marker test … x2 = $678 • Is there Convergence in the Tree? (e.g. 3/111) www.isogg.org/wiki/Fluxus
  23. 23. Topics for Discussion • Brief overview of key concepts • Building a tree with STRs • Building a tree with SNPs • Combining STRs & SNPs • Dating branching points in the tree • Combining STRs, SNPs & genealogy • Challenges for the years ahead
  24. 24. http://dna-explained.com/2014/10/15/tenth-annual-family-tree-dna-conference-wrapup/ Deep Clade Panel 2.0 - Targeted subclade panels - $119
  25. 25. Is fine-scale SNP testing the best method of determining branching patterns within a Genetic Family? … how to do it as cheaply & efficiently as possible?
  26. 26. Google: YouTube Genetic Genealogy Ireland
  27. 27. Working with SNPs – Opportunities & Challenges • Declaring SNPs - false positives • Missing SNPs - false negatives • Constant change – “Known, Novel, Shared & Private” • No name, just a location • SNP naming process unregulated – Same SNP, different names • Making results user-friendly • Lots of help available – independent verification & interpretation possible
  28. 28. Problems encountered with “declaring a genuine SNP” Problem Reason(s) Implication Detection No coverage False negative – SNP is present on Y but remains undetected Low no. of Calls Poor coverage False Negative – SNP present but fails to meet threshold criteria Recognition Detection Filter / Threshold too strict? False Negative - SNP is present in data but missed by analysis - detectable by manual analysis of possible SNPs on BAM file Localisation Difficult location on Y (centromere, palindrome, in STR / repetitive region) False Positive or Negative - SNP may be genuine but its exact position cannot be known for sure or may vary Instability Unstable SNP – frequent & unpredictable mutation False Positive or Negative - SNP may or may not be genuine InDels Not SNPs, but rather a deletion (usually) False Positive or Negative - may or may not be genuine So is the SNP really present? … or absent? Just because it is detected, doesn’t mean it is there … Just because it’s not detected, doesn’t mean it isn’t there
  29. 29. SNPs Known SNPs (already discovered) New SNPs (never discovered before) Shared (with someone else) Not shared (Unique / Private) “Known, Novel, Shared & Private” – the fluid categorisation of SNPs
  30. 30. Shared Novel Variants No names … just positions
  31. 31. Private SNPs (unique) No names … just positions
  32. 32. FTDNA Results (FT) Project Admin (LL) Haplogroup Admins* Alex (Big Tree) Williamson Nigel (Munster) McCarthy YFULL (YF) 11 2 3 2 1 4 Shared Novel Variants in Z16437 subgroup * Neal Downing, John Murphy, James Kane & Z255 Yahoo group
  33. 33. Lisa Little, project member
  34. 34. Gleeson Family Tree based on newly discovered SNP markers Lisa Little, project member
  35. 35. Z255 Haplogroup Project Colour Coded Spreadsheet (John Murphy) Gleeson-specific SNP markers https://groups.yahoo.com/neo/groups/R1b-Z255-Project
  36. 36. James Kane’s tree www.it2kane.org/matrix/R.html https://www.familytreedna.com/groups/r-l21-south-irish/about/background
  37. 37. http://www.ytree.net Alex Williamson’s “Big Tree”
  38. 38. … aka BY2853 Jan 2015 Apr 2015 Jun 2015 Oct 2015 www.ytree.net/DisplayTree.php?blockID=319&star=false Clicking on a marker or name brings up further analysis
  39. 39. www.ytree.net/MutMatrix.php Grey = no coverage Pink = marginal coverage My simplistic interpretation + Definite * Probable ** Possible *** Unlikely The Big Tree: R-A5629 Mutation Matrix of Shared SNPs
  40. 40. Currently Unique SNPs … 3 (1), 3 (2), 13 (5) = 19 (8) http://www.ytree.net/SNPinfoForPerson.php?personID=1288Alex Williamson’s “Big Tree”
  41. 41. YFULL Novel SNPs Alex Williamson’s “Big Tree” www.yfull.com
  42. 42. • Are they really SNPs? - different thresholds & filters • SNPs trapped in Private Collections - Private SNPs will be liberated as more people test & SNPs become “not private” anymore – move up into the shared area of the tree … but they will run out! When? • No names, just locations - will need to be translated into SNP names in time => consult Ybrowse, other utilities?? Inconsistency in “declaring a genuine SNP”
  43. 43. Different strokes for different folks Who is right? … or more accurately … who has estimated correctly? End Result SNP = definite, probable, possible, or unlikely … subject to change ... & Sanger Sequencing?
  44. 44. Despite NGS, Sanger Sequencing will still be required • Chip-based SNP testing will still be needed to confirm or refute discoveries made by NGS • Multiple Deep Clade Panels will need to be created … for subclades, surnames, & genetic clusters Some Bold Predictions …
  45. 45. Topics for Discussion • Brief overview of key concepts • Building a tree with STRs • Building a tree with SNPs • Combining STRs & SNPs • Dating branching points in the tree • Combining STRs, SNPs & genealogy • Challenges for the years ahead
  46. 46. • SNP results consistent? • Need to tidy it up 456 15-16
  47. 47. • SNPs are further up the tree than STRs • Tell us nothing about branches on left • Only use “definite SNPs” (not probable/possible) • Private SNPs are still trapped in Private Collections Mutation sequence? BY2853 > A5629 > 456 … > G68 (Glisson, Branch 14) > A5628 > Y16880 (Branch 2,7,6) > A660 (Branch 9)
  48. 48. http://freepages.genealogy.rootsweb.ancestry.com/~skibbgirl/McCarthyDNAProject/ G54 G39 G51 G66 G22 G42 G55 G57 G21 Nigel McCarthy
  49. 49. G54 G39 G51 G73 G66 G22 G42 G55 G57 G21 Nigel McCarthy’s Z255 Group E http://freepages.genealogy.rootsweb.ancestry.com/~skibbgirl/McCarthyDNAProject/ G68 No BY2852 block Extra marker Private SNPsPrivate SNPsPrivate SNPs 2 pink SNPs omitted Differing Modal Haplotype <67 markers excluded
  50. 50. Topics for Discussion • Brief overview of key concepts • Building a tree with STRs • Building a tree with SNPs • Combining STRs & SNPs • Dating branching points in the tree • Combining STRs, SNPs & genealogy • Challenges for the years ahead
  51. 51. Iain McDonald, The 2015 report to the U106 group (Sep 2015) www.jb.man.ac.uk/~mcdonald/genetics/u106-geography-2015-revised.pdf
  52. 52. www.familytreedna.com/groups/tmrca-case-studies/about Up till now, we know there are branches that come off the Modal But which came first? Can we place them in the correct order?
  53. 53. G57, 60393 G21, N74958 G55, 338070 G39, N101540 G51, 244645 • YFULL analysis offers TMRCA estimates for SNPs … and includes Calculation Formula -60% to +50%
  54. 54. 750 325 50
  55. 55. 0 3 10 Probability Markers tested GD 5% MLE 50% 95% Range (%) 12 1 3 17 >24 -82% to ??? 25 1 1 7 20 -85% to + 186% 37 1 0 3 10 -100% to + 233% 67 2 1 4 11 -75% to +175% 111 6 4 8 15 -50% to +88% 495 24 6 9 12 -33% to +33% G21 G57 MLE, Maximum Likelihood Estimate(?) • Ranges are wide & skewed toward distant generations • 111 markers gives the “best estimate” with smallest upper ranges but still almost double the mid-value
  56. 56. • Individually extracted 5%, 50% & 95% estimates (90% Confidence Interval) • Markers tested: White = 111, Yellow = 67, Cream = 37, Blue = 25 • 50% probability estimate ranges from 1 to >24 generations • Use triangulation to get better overall estimate? TMRCA Triangulation
  57. 57. 750 325 50 3 3 6,4,6 3,3 8,3,11 24,22,21,21*3, >24,18,15,20,22 9 12 11*3,1522,14,13*3,1 6 11 2 25 5.3 9.5 13 21 3 9.5 ? 14,14,11,11,22,22,17,17,18,18,15, 15,(20,13,14,14)*3,18,10,10,10 14,14,11,11,22,22,17,17,18,18,15, 15,(20,13,14,14)*3,18,10,10,1014.3 TMRCA Triangulation
  58. 58. Will additional STR markers help refine TMRCA estimates? • But … 5% differ? ... some are missing? ... not detected by NGS? • 35 mutations between G21 & G55 • 24 mutations between G21 & G57 • 9 mutations between G21 & G57
  59. 59. http://dna-project.clan-donald-usa.org/tmrca.htm
  60. 60. 0 3 10 Probability Markers tested GD 5% MLE 50% 95% Range (%) 12 1 3 17 >24 -82% to ??? 25 1 1 7 20 -85% to + 186% 37 1 0 3 10 -100% to + 233% 67 2 1 4 11 -75% to +175% 111 6 4 8 15 -50% to +88% 495 24 6 9 12 -33% to +33% Probability Markers tested GD 5% MLE 50% 95% Range (%) 12 1 3 17 >24 -82% to ??? 25 1 1 7 20 -85% to + 186% 37 1 0 3 10 -100% to + 233% 67 2 1 4 11 -75% to +175% 111 6 4 8 15 -50% to +88% 495 24 6 9 12 -33% to +33% G21 G57
  61. 61. 750 325 50 3 3 6,4,6 3,3 8,3,11 24,22,21,21*3, >24,18,15,20,22 9 12 11*3,1522,14,13*3,1 6 11 2 25 5.3 9.5 13 21 3 9.5 ?14,16,18,18 13,10 16.5 14.3 11.5 7
  62. 62. Topics for Discussion • Brief overview of key concepts • Building a tree with STRs • Building a tree with SNPs • Combining STRs & SNPs • Dating branching points in the tree • Combining STRs, SNPs & genealogy • Challenges for the years ahead
  63. 63. 750 325 50 3 3 6,4,6 3,3 8,3,11 24,22,21,21*3, >24,18,15,20,22 9 12 11*3,1522,14,13*3,1 6 11 2 25 5.3 9.5 13 21 3 9.5 ?14,16,18,18 13,10 16.5 14.3 11.5 ???? ???? 7
  64. 64. 750 325 50 3 3 6,4,6 3,3 8,3,11 24,22,21,21*3, >24,18,15,20,22 9 12 11*3,1522,14,13*3,1 6 11 2 25 5.3 9.5 13 21 3 9.5 ?14,16,18,18 13,10 16.5 14.3 11.5 ???? ???? 7 MDKA Profile
  65. 65. MDKA Profiles http://gleesondna.blogspot.com
  66. 66. A Combined Mutation / Family History Tree … using DNA markers when people run out … is it possible?
  67. 67. Topics for Discussion • Brief overview of key concepts • Building a tree with STRs • Building a tree with SNPs • Combining STRs & SNPs • Dating branching points in the tree • Combining STRs, SNPs & genealogy • Opportunities for the years ahead
  68. 68. Lessons Learned & Future Opportunities • Transcription errors are easy => triple-check, automate • Re STRs – Lots of Parallel Mutations … where are the Back Mutations? – 111 markers best define the branching pattern – Placement of CDY & 464 is likely to be incorrect (esp. in upstream generations) – Most project members have not tested other male cousins to triangulate on their MDKA – Convergence may be a problem (even at 3/111) – We need more people to test – We need more people to upgrade to 111 markers – YFULL analysis liberates 495 STRs
  69. 69. Lessons Learned & Future Opportunities • Re SNPs – Difficult to declare a genuine SNP – Different SNPs from different lips – Definite, probable, possible, unlikely – Likely to be lots of false negatives (& false positives) – No names (locations too long) – Naming is unregulated – Many SNPs trapped in Private Collections – Current NGS is discovery, not confirmatory => further testing (with other NGS?) needed to confirm
  70. 70. Lessons Learned & Future Opportunities • Re combining STRs & SNPs – Adding SNPs changed the upper reaches of the tree – SNPs are still located relatively upstream - STRs offer better definition downstream – Start with the Modal of your Haplogoup subgroup • Re TMRCA estimates – SNP-based estimates work best for distant branching points (haplogroup projects) – STR-based estimates have wide ranges, and skewed toward distant generations – Even at 111, upper range ~ double the mid-value – Even 495 markers has a wide range (+/- 33%)
  71. 71. Lessons Learned & Future Opportunities • Re combining STRs, SNPs & genealogy – We need to overlay documentary data on DNA – Some pedigrees not supplied / incomplete – Need to add MPRs to all (MDKA Profile) – Need to take a One Name Study approach? • Collate all Gleeson data worldwide • Establish a relational database (Access?) • Assign data to different family branches • This early draft MHT serves as a useful basis – Will evolve over time as more people test & upgrade – Will faciltate collaboration between project members – Will help attract new project members
  72. 72. Vision 2020 Where will we be in 5 years time? Here are some bold predictions …
  73. 73. What would happen if … • Everyone upgraded to 111 markers? – Better definition of branching pattern – More precise TMRCA estimates (with narrower range) • Everyone did the Big Y? – SNPs only good for upstream branches? (<1500 AD) – We will run out of Private SNPs • Everyone tested on a Surname Specific Panel? – Would elucidate branching pattern up to 1500 AD? Later? • Everyone did Whole Genome Sequencing? – No better than Big Y? Better coverage? Better read length? – What will happen to Probable / Possible / Unlikely SNPs?
  74. 74. Some Bold Predictions … • (To help stimulate discussion & to learn) • What is most useful for Surname Projects – more SNPs or more STRs? – More STRs … we will run out of Private SNPs – 111 vs 50,000 – 500 vs 40? • In 2020, FTDNA will offer 500 STRs for $129
  75. 75. Some Bold Predictions … • How do we best generate a Surname-Specific SNP Panel? – Q: How many discovery Big Y tests are needed to liberate sufficient Private SNPs to adequately define the Surname Panel? – A: 5-10 Big Y tests per genetic cluster – We need another few people to Big Y test, then generate the Surname Panel for Lineage II • In 2020, FTDNA will offer over 4000 Surname Specific SNP Panels for $100 each
  76. 76. Generate MHTree More tools Lineage I Lineage II Lineage III Lineage IV Lineage II Mutation History Tree
  77. 77. Acknowledgements • Bennett Greenspan • Max Blankfield • Janine Cloud • FTDNA team • Judy Claassen • Lisa Little • James Irvine • Ralph Taylor • John Cleary • Haplogroup Admins • John Murphy • Neal Downing • James Kane • Alex Williamson • Nigel McCarthy • Dennis Wright • Alasdair MacDonald • YFULL team The Genetic Genealogy Community

×