SyMAP Synteny Mapping and Analysis Program Austin Shoemaker
SyMAP Team Dr. Cari Soderlund Dr. Will Nelson Austin Shoemaker Interactive SyMAP views Sytry Testing environment for synteny finding algorithms Worked with the team on: The synteny finding algorithm MySQL database schema
Background Comparative Genomics Physical Map Computing Synteny Properties of FPC to Genome Synteny
Comparative Genomics Compare genomes of different species Knowledge of one helps understand the other Gene Function Organism O 1  has a gene G 1 Organism O 2  has a gene G 2  with a sequence similar to G 1 G 1  and G 2  may have similar functions Evolutionary History Genome rearrangements
Genome Rearrangements Rearrangement  Scenario  Result Inversion Duplication Insertion Deletion A BCD E  A DCB E A B C  AC AB  A C B A B  CD  A B   B CD A B C  A B C B , A BB C AB   AB   AB
Whole-Genome Duplication mya (million years ago) Last Common Ancestor rice maize diverged 50-70 mya 70 mya duplication 11 mya duplication
Synteny At least two pairs of genes with similar structure and function on the same chromosome Order does not need to be conserved Often found using sequenced genomes We use a physical map and a genomic sequence Genome A c d e f g c d e f g Genome B
Physical Map Expensive to sequence large genomes A physical map provides partial ordering of pieces of DNA and pieces of genes
FPC Map FingerPrinted Contigs Soderlund et al. 1997 Type of physical map Made up of clones Snippets of DNA We use BAC clones Bacterial artificial chromosome clones Stored in clone libraries
Making a BAC Clone Library Take thousands of copies of a genome Cut it up into overlapping pieces  (~150,000 base pairs) Restriction enzymes Proteins that cut at specific DNA sequences Partial digestion Restriction enzymes not allowed to cut at all possible locations so that the clones overlap
Clones Each clone is stored in a well on a microtiter plate Do not know the order of the clones, or where each clone is on the chromosome
Clone Fingerprinting Clone fingerprints are found to gather more information on a clone Fully digest a clone using restriction enzymes If two clones share many fragments, they may overlap
Clone Fingerprinting Fragments are run on a gel Shorter fragments migrate faster Measure migration rate False positives and false negatives
FPC Assembles fingerprinted clones into contigs Contig -> contiguous overlapping clones Assembles into many contigs instead of one large contig  Unclonable regions Uneven distribution
Markers Markers are pieces of DNA ~ 300 base pairs Hybridization A marker hybridizes to a clone when the clone contains the marker
BESs Expensive to sequence entire clones BAC End Sequences BESs are sequences from the ends of BAC clones ~800 base pairs Do not know which end the sequence comes from There are errors in the sequence
Anchors Locations of two genomes found to be similar through a comparison of DNA sequences We use marker sequences and BESs searched against a known genome sequence Maize has an FPC map with markers and BESs The rice genome is sequenced G G C C G T G G T G C T C T T T G C A A T G G G G G C T G T G G T G C T C T T C G C A A T G G G
Component Summary
Finding Chains
Key Synteny Finding Algorithms Vandepoele et al. (2002) – ADHoRe Variable gap size Coefficient of determination to determine the quality of a synteny block Haas et al. (2004) – DAGchainer Directed acyclic graph Dynamic programming Gap penalty
Other Synteny Finding Algorithms Key characteristics for us: Dynamic programming Ordering the anchors to form a DAG Gap penalty Variable gap size Not appropriate for finding synteny using an FPC map Do not consider the error conditions that arise
FPC to Genome Synteny Properties associated with FPC FPC maps do not cover the entire genome False+ and False- hybridized markers FPC coordinates are approximate Which end of the parent clone a BES belongs to is unknown
FPC Synteny Properties 1  x 2  o 3  x  4  x  5  o 6  8  #  x  9  x a  x b  x c  x 7  x 1  2  3  4  5  6  7  8  9  a  b  c Genome A (FPC map) Genome B (sequenced genome)
Noise
SyMAP Algorithm Anchor (a k , b l )  a k  is the location on the FPC map of genome G A b l  is the location on the genomic sequence of G B Directed Acyclic Graph E = {u, v | |a k -a i |    M A  and 0    b l -b j     M B }  where u = (a i , b j ), v = (a k , b l ) are anchors Allows edges decreasing along G A Catch off-diagonal anchors Some inversions
SyMAP Algorithm Manhattan distance function with scaling D(v, w) =   |a k  - a i | / t A  + |b l  - b j | / t B  Average distance between anchors may be different Dynamic Programming Node(v) = 1 + Max(0,  Max u  P(v)  (Node(u) - D(u,v)))  P(v) is the set of edges (u,v)    E 1 is the score given to an individual anchor Plus the maximum path score for a previous node Penalized by the distance between the nodes
SyMAP Algorithm Chains must satisfy constraints Number of anchors Strength of line  Pearson correlation coefficient Required to be more precisely linear the closer they are to the minimal number of anchors Exception for small and dense chains Lower correlation due to errors in the assignment of BES ends or clone ordering within a contig
Sytry Tool for testing synteny finding algorithms Allows for modifying the parameters of an algorithm and rerunning Results are shown as a dot plot Need to visually confirm results, as correct Correct is what looks right to the user
Automated Parameter Setting Difficult to set parameters (e.g.,  t A  and t B ) Effects of changes can be unclear Dependent on average distance between anchors and noise Optimal values vary between regions Have the algorithm set the gap parameters Attempt to optimize t x  for each chain
Sub-Chains Overall orientation of a synteny chain may not be accurate for sub-chains
Sub-Chain Finder Use only anchors that are part of a chain Define distance between anchors in terms of the number of anchors that fall between the anchors A significant gap signals the start of a possible inversion
Sub-Chains Evolutionary history e.g., total number of inversions Assigning an accurate orientation to all anchors in a chain Beneficial for fixing the clone end assignment of BES
BES Clone End Assignments BESs are arbitrarily assigned to clone ends Algorithm takes this into account However, the synteny when viewing can be distorted Orientation can be used to correct BES assignments
BES Clone End Assignments positive orientation -> lines should not cross 2  x 3  o 4  5  6  o 7  x 8  x 1  2  3  4  5  6  7  8 A B 1  x B A 2 3 4 5 6 7 8 2 3 4 5 6 7 8 1 1
BES Clone End Assignments negative orientation -> lines should cross 7  x 6  o 5 4  3  o 2  x 1  x 1  2  3  4  5  6  7  8 A 8  x B B A 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 8
SyMAP Views Accessible through a web browser Static views All synteny blocks ↔ sequenced chromosomes Synteny blocks ↔ sequenced chromosome Interactive views Dot plot view Genome to genome Chromosome to chromosome Alignment view FPC ↔ sequenced chromosome FPC ↔ FPC FPC ↔ sequenced chromosome ↔ FPC Close-up view FPC ↔ sequenced chromosome
All Blocks ↔ Sequenced Chromosomes
Blocks ↔ Sequenced Chromosome
Genome ↔ Genome Dot Plot
Chromosome ↔ Chromosome Dot Plot
Block ↔ Sequenced Chromosome
Subset Flipped
Contig ↔ Sequenced Chromosome
Filters and Controls
FPC ↔ Sequenced Chromosome ↔ FPC
FPC ↔ FPC
Close-up of Gene
SyMAP Implementation Caching is needed: Downloads large amounts of data from remote database History feature Navigating back and forth between the same views Soft References Remain alive as long as the memory is available Data objects Hold data in a compact form Converted to view objects when needed
Results www.agcol.arizona.edu/symap Maize and sorghum aligned to rice Maize FPC aligned to sorghum FPC Used in editing the maize FPC maps based on its alignment to rice  (Wei et al., in preparation)  Alignment of maize to rice chromosome 3 Buell et al. (2005)  Used in OMAP project  Aligning 12 species of rice to the sequenced genome of rice  (Wing et al., in preparation)
Acknowledgements Thesis Committee Dr. Cari Soderlund, thesis advisor Dr. Peter Downey Dr. Kobus Bernard This work is funded in part by NSF DBI #0115903 www.agcol.arizona.edu/symap

SyMAP Master's Thesis Presentation

  • 1.
    SyMAP Synteny Mappingand Analysis Program Austin Shoemaker
  • 2.
    SyMAP Team Dr.Cari Soderlund Dr. Will Nelson Austin Shoemaker Interactive SyMAP views Sytry Testing environment for synteny finding algorithms Worked with the team on: The synteny finding algorithm MySQL database schema
  • 3.
    Background Comparative GenomicsPhysical Map Computing Synteny Properties of FPC to Genome Synteny
  • 4.
    Comparative Genomics Comparegenomes of different species Knowledge of one helps understand the other Gene Function Organism O 1 has a gene G 1 Organism O 2 has a gene G 2 with a sequence similar to G 1 G 1 and G 2 may have similar functions Evolutionary History Genome rearrangements
  • 5.
    Genome Rearrangements Rearrangement Scenario Result Inversion Duplication Insertion Deletion A BCD E A DCB E A B C AC AB A C B A B CD A B B CD A B C A B C B , A BB C AB AB AB
  • 6.
    Whole-Genome Duplication mya(million years ago) Last Common Ancestor rice maize diverged 50-70 mya 70 mya duplication 11 mya duplication
  • 7.
    Synteny At leasttwo pairs of genes with similar structure and function on the same chromosome Order does not need to be conserved Often found using sequenced genomes We use a physical map and a genomic sequence Genome A c d e f g c d e f g Genome B
  • 8.
    Physical Map Expensiveto sequence large genomes A physical map provides partial ordering of pieces of DNA and pieces of genes
  • 9.
    FPC Map FingerPrintedContigs Soderlund et al. 1997 Type of physical map Made up of clones Snippets of DNA We use BAC clones Bacterial artificial chromosome clones Stored in clone libraries
  • 10.
    Making a BACClone Library Take thousands of copies of a genome Cut it up into overlapping pieces (~150,000 base pairs) Restriction enzymes Proteins that cut at specific DNA sequences Partial digestion Restriction enzymes not allowed to cut at all possible locations so that the clones overlap
  • 11.
    Clones Each cloneis stored in a well on a microtiter plate Do not know the order of the clones, or where each clone is on the chromosome
  • 12.
    Clone Fingerprinting Clonefingerprints are found to gather more information on a clone Fully digest a clone using restriction enzymes If two clones share many fragments, they may overlap
  • 13.
    Clone Fingerprinting Fragmentsare run on a gel Shorter fragments migrate faster Measure migration rate False positives and false negatives
  • 14.
    FPC Assembles fingerprintedclones into contigs Contig -> contiguous overlapping clones Assembles into many contigs instead of one large contig Unclonable regions Uneven distribution
  • 15.
    Markers Markers arepieces of DNA ~ 300 base pairs Hybridization A marker hybridizes to a clone when the clone contains the marker
  • 16.
    BESs Expensive tosequence entire clones BAC End Sequences BESs are sequences from the ends of BAC clones ~800 base pairs Do not know which end the sequence comes from There are errors in the sequence
  • 17.
    Anchors Locations oftwo genomes found to be similar through a comparison of DNA sequences We use marker sequences and BESs searched against a known genome sequence Maize has an FPC map with markers and BESs The rice genome is sequenced G G C C G T G G T G C T C T T T G C A A T G G G G G C T G T G G T G C T C T T C G C A A T G G G
  • 18.
  • 19.
  • 20.
    Key Synteny FindingAlgorithms Vandepoele et al. (2002) – ADHoRe Variable gap size Coefficient of determination to determine the quality of a synteny block Haas et al. (2004) – DAGchainer Directed acyclic graph Dynamic programming Gap penalty
  • 21.
    Other Synteny FindingAlgorithms Key characteristics for us: Dynamic programming Ordering the anchors to form a DAG Gap penalty Variable gap size Not appropriate for finding synteny using an FPC map Do not consider the error conditions that arise
  • 22.
    FPC to GenomeSynteny Properties associated with FPC FPC maps do not cover the entire genome False+ and False- hybridized markers FPC coordinates are approximate Which end of the parent clone a BES belongs to is unknown
  • 23.
    FPC Synteny Properties1 x 2 o 3 x 4 x 5 o 6 8 # x 9 x a x b x c x 7 x 1 2 3 4 5 6 7 8 9 a b c Genome A (FPC map) Genome B (sequenced genome)
  • 24.
  • 25.
    SyMAP Algorithm Anchor(a k , b l ) a k is the location on the FPC map of genome G A b l is the location on the genomic sequence of G B Directed Acyclic Graph E = {u, v | |a k -a i |  M A and 0  b l -b j  M B } where u = (a i , b j ), v = (a k , b l ) are anchors Allows edges decreasing along G A Catch off-diagonal anchors Some inversions
  • 26.
    SyMAP Algorithm Manhattandistance function with scaling D(v, w) =  |a k - a i | / t A + |b l - b j | / t B  Average distance between anchors may be different Dynamic Programming Node(v) = 1 + Max(0, Max u  P(v) (Node(u) - D(u,v))) P(v) is the set of edges (u,v)  E 1 is the score given to an individual anchor Plus the maximum path score for a previous node Penalized by the distance between the nodes
  • 27.
    SyMAP Algorithm Chainsmust satisfy constraints Number of anchors Strength of line Pearson correlation coefficient Required to be more precisely linear the closer they are to the minimal number of anchors Exception for small and dense chains Lower correlation due to errors in the assignment of BES ends or clone ordering within a contig
  • 28.
    Sytry Tool fortesting synteny finding algorithms Allows for modifying the parameters of an algorithm and rerunning Results are shown as a dot plot Need to visually confirm results, as correct Correct is what looks right to the user
  • 29.
    Automated Parameter SettingDifficult to set parameters (e.g., t A and t B ) Effects of changes can be unclear Dependent on average distance between anchors and noise Optimal values vary between regions Have the algorithm set the gap parameters Attempt to optimize t x for each chain
  • 30.
    Sub-Chains Overall orientationof a synteny chain may not be accurate for sub-chains
  • 31.
    Sub-Chain Finder Useonly anchors that are part of a chain Define distance between anchors in terms of the number of anchors that fall between the anchors A significant gap signals the start of a possible inversion
  • 32.
    Sub-Chains Evolutionary historye.g., total number of inversions Assigning an accurate orientation to all anchors in a chain Beneficial for fixing the clone end assignment of BES
  • 33.
    BES Clone EndAssignments BESs are arbitrarily assigned to clone ends Algorithm takes this into account However, the synteny when viewing can be distorted Orientation can be used to correct BES assignments
  • 34.
    BES Clone EndAssignments positive orientation -> lines should not cross 2 x 3 o 4 5 6 o 7 x 8 x 1 2 3 4 5 6 7 8 A B 1 x B A 2 3 4 5 6 7 8 2 3 4 5 6 7 8 1 1
  • 35.
    BES Clone EndAssignments negative orientation -> lines should cross 7 x 6 o 5 4 3 o 2 x 1 x 1 2 3 4 5 6 7 8 A 8 x B B A 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 8
  • 36.
    SyMAP Views Accessiblethrough a web browser Static views All synteny blocks ↔ sequenced chromosomes Synteny blocks ↔ sequenced chromosome Interactive views Dot plot view Genome to genome Chromosome to chromosome Alignment view FPC ↔ sequenced chromosome FPC ↔ FPC FPC ↔ sequenced chromosome ↔ FPC Close-up view FPC ↔ sequenced chromosome
  • 37.
    All Blocks ↔Sequenced Chromosomes
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
    FPC ↔ SequencedChromosome ↔ FPC
  • 46.
  • 47.
  • 48.
    SyMAP Implementation Cachingis needed: Downloads large amounts of data from remote database History feature Navigating back and forth between the same views Soft References Remain alive as long as the memory is available Data objects Hold data in a compact form Converted to view objects when needed
  • 49.
    Results www.agcol.arizona.edu/symap Maizeand sorghum aligned to rice Maize FPC aligned to sorghum FPC Used in editing the maize FPC maps based on its alignment to rice (Wei et al., in preparation) Alignment of maize to rice chromosome 3 Buell et al. (2005) Used in OMAP project Aligning 12 species of rice to the sequenced genome of rice (Wing et al., in preparation)
  • 50.
    Acknowledgements Thesis CommitteeDr. Cari Soderlund, thesis advisor Dr. Peter Downey Dr. Kobus Bernard This work is funded in part by NSF DBI #0115903 www.agcol.arizona.edu/symap

Editor's Notes

  • #2 Hello. I am Austin Shoemaker and I’m going to be presenting SyMAP, a synteny mapping and analysis program.