Your SlideShare is downloading. ×
0
Genometry
Gregg Helt
Cyrus Harmon
Genometry
•  Motivation and Purpose
•  Points of Reference
•  Genometry interfaces
•  Genometry manipulations
•  Genometry...
Motivation and Goals
•  Desire for a more unified data model to represent
relationships between biological sequences, such...
Points of Reference
•  com.neomorphic.bio models
•  Genisys DB and Genisys IDL
•  EBI mapping models
•  Apollo data models...
Basic Annotations
Transcript T
Genome G
Transcript T
G: 1000..5000
Exon E1
G:1000..1200
Exon E2
G:3000..3500
Exon E3
G:450...
Genometry Annotations – Specify All Coordinates
Transcript T
Genome G
Transcript T
G: 1000..5000
T:0..1200
Exon E1
G:1000....
Genometry Annotations – All coordinates are
relative to BioSeqs
Transcript T
Genome G
TranscriptAnnot T1
G: 1000..5000
T:0...
Genometry Annotations – SeqSpans encapsulate a
range along a BioSeq
Transcript T
Genome G
TranscriptAnnot T1
ExonAnnot E1 ...
Genometry Core Core
•  BioSeq
–  length, residues (optional)
•  SeqSpan
–  start, end, BioSeq
•  SeqSymmetry
–  SeqSpans (...
Expressiveness of Core Core
•  “Standard” annotations
•  Singleton annotations
•  Alternative Splicing
•  Pairwise alignme...
Genometry Modelling of Insertions and Deletions #1a
G:1000..1006
T:7..18
G:1000..1017
T:0..6
G:1006..1017
T:0..18
…AGGCAAT...
Genometry Modelling of Insertions and Deletions #1b
G: g0..g2
T:t0..t2
…AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG…
GCAA...
Genometry Modelling of Insertions and Deletions #2
G:g0..g1
T:t0..t1 T:t1+1..t2
G:g1..g2
G: g0..g2
T:t0..t2
…AGGCAATTAATTG...
Genometry Modelling of Insertions and Deletions #3
G:g0..g1
T:t0..t1 T:t1+1..t2
G:g1..g2
G: g0..g2
T:t0..t2
…AGGCAATTAATTG...
Genometry Modelling of Insertions and Deletions #4
G:g0..g1
T:t0..t1 T:t1+1..t2
G:g1..g2
G: g0..g2
T:t0..t2
…AGGCAATTAATTG...
Modelling SNPs with Genometry: Two Approaches
SeqB : 0..n
SeqA : 0..x
SeqB : 0..x
“T” : 0..1
SeqB : x..x+1
SeqA : 0..m
Seq...
Modelling SNPs with Genometry: Two Approaches
SeqB : 0..n
SeqA : 0..x
SeqB : 0..x
“T” : 0..1
SeqB : x..x+1
SeqA : 0..m
Seq...
Sequence-oriented annotations
•  AnnotatedBioSeq
–  Contains a collection of SeqSymmetries that annotate the
sequence
–  I...
Annotation Networks
•  Can traverse networks of annotations, alternating between
AnnotatedBioSeqs and SeqSymmetries
protei...
Sequence Composition
•  CompositeBioSeq
– Contains a SeqSymmetry describing the mapping
of BioSeqs used in composition to ...
Sequence Composition Representations
•  Sequence Assembly / Golden Path / etc.
•  Piecewise data loading / lazy data loadi...
Genometry Modelling of Reverse Complement
Sequence B = reverse complement of Sequence A
BioSeq A
length: x
Composite
BioSe...
MultiSequence Alignments
•  MultiSeqAlignment
–  Alignments sliced “horizontally” -- each “row” in an alignment is a
Compo...
Alignment Representations
•  Can represent same alignment as either MultiSeqAlignment or Synteny
•  Transformation from ho...
Complete Genometry Core Models
•  Mutability
•  Curations
Genometry Manipulations
•  Symmetry Intersection (AND)
•  Symmetry Union (OR)
•  Symmetry Inverse (NOT)
•  Symmetry Mutual...
Symmetry Combination Operations
SymA
SymB
XOR(A, B)
AND(A, B)
OR(A, B)
NOT(A)
NOT(B)
Genometry Transformations
•  Every symmetry of breadth > 1 describes a mapping
between different sequences
•  Therefore ev...
Coordinate
Mapping
(note that domain mapped to spliced transcript only overlaps two of the three exons,
hence only end up ...
mRNA2genomic
genomicSpanC
mrnaSpanC
m2gSub0
gSpanC0
mSpanC0
m2gSub1
gSpanC1
mSpanC1
m2gSub2
gSpanC2
mSpanC2
domain2genomic...
Transformations Applications
•  Mapping Affy probes to genome
•  Mapping contig annotations to larger genomic assemblies
•...
Prototypes & Applications
•  GenometryTest
•  Generic Genometry Viewer
•  ProtAnnot (Ann)
•  GPView (Cyrus)
•  AlignView (...
Genometry Summary
•  Genometry presents a unified model for
location-based sequence relationships
•  Sequence annotation, ...
IGB genome genometry data models by Gregg Helt and Cyrus Harmon
IGB genome genometry data models by Gregg Helt and Cyrus Harmon
IGB genome genometry data models by Gregg Helt and Cyrus Harmon
Upcoming SlideShare
Loading in...5
×

IGB genome genometry data models by Gregg Helt and Cyrus Harmon

202

Published on

These slides were developed by Gregg Helt and Cyrus Harmon to explain the core data models in Integrated Genome Browser. The goal was to make translation between protein, transcript, and genome coordinate systems easier and more powerful. These data models are what makes IGB capable of correctly displaying probes that are split across intron boundaries. They also form the core of the ProtAnnot application, that displays protein domains mapped onto genomic sequence.

Published in: Science, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
202
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "IGB genome genometry data models by Gregg Helt and Cyrus Harmon"

  1. 1. Genometry Gregg Helt Cyrus Harmon
  2. 2. Genometry •  Motivation and Purpose •  Points of Reference •  Genometry interfaces •  Genometry manipulations •  Genometry implementation •  Representation examples •  Prototype apps •  Current status, future work
  3. 3. Motivation and Goals •  Desire for a more unified data model to represent relationships between biological sequences, such as: –  Annotations –  Alignments –  Sequence composition •  More networked, less hierarchical (genome-centric, transcript-centric) •  Simplicity •  Expressivity / Flexibility •  Memory and Computational Efficiency •  Use by others to provide core functionality for various Affy projects
  4. 4. Points of Reference •  com.neomorphic.bio models •  Genisys DB and Genisys IDL •  EBI mapping models •  Apollo data models •  BioPerl •  BioJava •  Closest similarity to bio alignment models and Genisys alignment models
  5. 5. Basic Annotations Transcript T Genome G Transcript T G: 1000..5000 Exon E1 G:1000..1200 Exon E2 G:3000..3500 Exon E3 G:4500..5000
  6. 6. Genometry Annotations – Specify All Coordinates Transcript T Genome G Transcript T G: 1000..5000 T:0..1200 Exon E1 G:1000..1200 T:0..200 Exon E2 G:3000..3500 T:200..700 Exon E3 G:4500..5000 T:700..1200
  7. 7. Genometry Annotations – All coordinates are relative to BioSeqs Transcript T Genome G TranscriptAnnot T1 G: 1000..5000 T:0..1200 ExonAnnot E1 G:1000..1200 T:0..200 ExonAnnot E2 G:3000..3500 T:200..700 ExonAnnot E3 G:4500..5000 T:700..1200 Transcript T Genome G
  8. 8. Genometry Annotations – SeqSpans encapsulate a range along a BioSeq Transcript T Genome G TranscriptAnnot T1 ExonAnnot E1 ExonAnnot E2 ExonAnnot E3 Transcript T Genome G G: 1000..5000 T: 0..200 G:1000..1200 T:0..200 G:3000..3500 T:200..700 G:4500..5000 T:700..1200
  9. 9. Genometry Core Core •  BioSeq –  length, residues (optional) •  SeqSpan –  start, end, BioSeq •  SeqSymmetry –  SeqSpans (breadth) –  SeqSymmetry parent / child hierarchy (depth)
  10. 10. Expressiveness of Core Core •  “Standard” annotations •  Singleton annotations •  Alternative Splicing •  Pairwise alignments •  Annotations with depth > 2 •  Annotations with breadth > 2 •  Indels •  Structure of analyzed sequence •  Fuzzy locations •  All without explicit pointers from BioSeq to annotation
  11. 11. Genometry Modelling of Insertions and Deletions #1a G:1000..1006 T:7..18 G:1000..1017 T:0..6 G:1006..1017 T:0..18 …AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG… GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG G:2000..2017 T:18..34 G:2000..2010 T:28..34T:18..28 G:2011..2017 G:1000..2017 T:0..34 insertion in transcript relative to genome (deletion in genome relative to transcript) deletion in transcript relative to genome (insertion in genome relative to transcript) Genome G Transcript T
  12. 12. Genometry Modelling of Insertions and Deletions #1b G: g0..g2 T:t0..t2 …AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG… GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG G:g3..g5 T:t3..t5 G:g3..g4 T:t4..t5T:t3..t4 G:g4+1..g5G:g0..g1 T:t0..t1 T:t1+1..t2 G:g1..g2 G:g0..g5 T:t0..t5 insertion in transcript relative to genome (deletion in genome relative to transcript) deletion in transcript relative to genome (insertion in genome relative to transcript) Genome G Transcript T t0 t1 t1+1 t2 g0 g1 g2 g3 g4 g4+1 g5 t3 t4 t5
  13. 13. Genometry Modelling of Insertions and Deletions #2 G:g0..g1 T:t0..t1 T:t1+1..t2 G:g1..g2 G: g0..g2 T:t0..t2 …AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG… GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG G:g3..g5 T:t3..t5 G:g3..g4 T:t3..t4 T:t4..t5 G:g4+1..g5 G:g0..g5 T:t0..t5 insertion in transcript relative to genome (deletion in genome relative to transcript) deletion in transcript relative to genome (insertion in genome relative to transcript) Genome G Transcript T T:t1..t1+1 “C” :0..1 t0 t1 t1+1 t2 g0 g1 g2 g3 g4 g4+1 g5 t3 t4 t5 G:g4..g4+1 “G” :0..1
  14. 14. Genometry Modelling of Insertions and Deletions #3 G:g0..g1 T:t0..t1 T:t1+1..t2 G:g1..g2 G: g0..g2 T:t0..t2 …AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG… GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG G:g3..g5 T:t3..t5 G:g3..g4 T:t3..t4 T:t4..t5 G:g4+1..g5 G:g0..g5 T:t0..t5 insertion in transcript relative to genome (deletion in genome relative to transcript) deletion in transcript relative to genome (insertion in genome relative to transcript) Genome G Transcript T T:t1..t1+1 G:g1..g1 t0 t1 t1+1 t2 g0 g1 g2 g3 g4 g4+1 g5 t3 t4 t5 G:g4..g4+1 T:t4..t4
  15. 15. Genometry Modelling of Insertions and Deletions #4 G:g0..g1 T:t0..t1 T:t1+1..t2 G:g1..g2 G: g0..g2 T:t0..t2 …AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG… GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG G:g3..g5 T:t3..t5 G:g3..g4 T:t3..t4 T:t4..t5 G:g4+1..g5 G:g0..g5 T:t0..t5 insertion in transcript relative to genome (deletion in genome relative to transcript) deletion in transcript relative to genome (insertion in genome relative to transcript) Genome G Transcript T t0 t1 t1+1 t2 g0 g1 g2 g3 g4 g4+1 g5 t3 t4 t5 T:t1..t1+1 G:g1..g1 “C”:0..1 T:t4..t4 G:g4..g4+1 “G”:0..1
  16. 16. Modelling SNPs with Genometry: Two Approaches SeqB : 0..n SeqA : 0..x SeqB : 0..x “T” : 0..1 SeqB : x..x+1 SeqA : 0..m SeqA : x+1..m SeqB : x+1..n SeqA : x..x+1…GGCAAGGAATGATC…SeqA x x+1 …GGCAAGGAATGATC…SeqA SeqB …GGCAAGTAATGATC… x x+1 SeqA = reference chromosome SeqB = exactly same as reference chromosome, except for one SNP I. SNPs as annotations of differences between sequences II. SNPs as gaps in similarity between two sequences T SeqB : x..x+1 SeqA : x..x+1…GGCAAGGAATGATC…SeqA SeqB …GGCAAGTAATGATC… x x+1 “T” : 0..1 SeqA : x..x+1…GGCAAGGAATGATC…SeqA T x x+1 I.a. annotation of just reference seq I.b. annotation of reference seq w/ variant base I.c. annotation of reference and variant seq
  17. 17. Modelling SNPs with Genometry: Two Approaches SeqB : 0..n SeqA : 0..x SeqB : 0..x “T” : 0..1 SeqB : x..x+1 SeqA : 0..m SeqA : x+1..m SeqB : x+1..n SeqA : x..x+1…GGCAAGGAATGATC…SeqA x x+1 …GGCAAGGAATGATC…SeqA SeqB …GGCAAGTAATGATC… x x+1 SeqA = reference chromosome SeqB = exactly same as reference chromosome, except for one SNP I. SNPs as annotations of differences between sequences II. SNPs as gaps in similarity between two sequences T SeqB : x..x+1 SeqA : x..x+1…GGCAAGGAATGATC…SeqA SeqB …GGCAAGTAATGATC… x x+1 “T” : 0..1 SeqA : x..x+1…GGCAAGGAATGATC…SeqA T x x+1 I.a. annotation of just reference seq I.b. annotation of reference seq w/ variant base I.c. annotation of reference and variant seq
  18. 18. Sequence-oriented annotations •  AnnotatedBioSeq –  Contains a collection of SeqSymmetries that annotate the sequence –  Interfaces to retrieve annotations covered by a span within the sequence
  19. 19. Annotation Networks •  Can traverse networks of annotations, alternating between AnnotatedBioSeqs and SeqSymmetries protein2mRNA proteinSpanB mrnaSpanB mRNA2genomic genomicSpanC mrnaSpanC Annotated GenomicSeq G Annotated mRNASeq M Annotated ProteinSeq P m2gSub0 gSpanC0 mSpanC0 m2gSub1 gSpanC1 mSpanC1 m2gSub2 gSpanC2 mSpanC2 domainOnProtein proteinSpanA = AnnotatedBioSeq = SeqSymmetry
  20. 20. Sequence Composition •  CompositeBioSeq – Contains a SeqSymmetry describing the mapping of BioSeqs used in composition to the CompositeBioSeq itself
  21. 21. Sequence Composition Representations •  Sequence Assembly / Golden Path / etc. •  Piecewise data loading / lazy data loading •  Genotypes •  Chromosomal Rearrangements •  Primer construction •  Reverse Complement •  Coordinate Shifting
  22. 22. Genometry Modelling of Reverse Complement Sequence B = reverse complement of Sequence A BioSeq A length: x Composite BioSeq B length: x A:0..x B:x..0 Sym AB composition AGGCAATTAATTGATCCAGGTGGAGTCCGAATAGGGTTAGCGA TCGCTAACCCTATTCGGACTCCACCTGGATCAATTAATTGCCT SeqA SeqB
  23. 23. MultiSequence Alignments •  MultiSeqAlignment –  Alignments sliced “horizontally” -- each “row” in an alignment is a CompositeBioSeq whose composition maps another BioSeq to the same coord space as the alignment •  Can also slice vertically (synteny)
  24. 24. Alignment Representations •  Can represent same alignment as either MultiSeqAlignment or Synteny •  Transformation from horizontal slicing (MultiSeqAlignment) to vertical slicing (Synteny)
  25. 25. Complete Genometry Core Models •  Mutability •  Curations
  26. 26. Genometry Manipulations •  Symmetry Intersection (AND) •  Symmetry Union (OR) •  Symmetry Inverse (NOT) •  Symmetry Mutual Exclusion (XOR) •  Symmetry Transformation / Mapping
  27. 27. Symmetry Combination Operations SymA SymB XOR(A, B) AND(A, B) OR(A, B) NOT(A) NOT(B)
  28. 28. Genometry Transformations •  Every symmetry of breadth > 1 describes a mapping between different sequences •  Therefore every symmetry can be used to transform coordinates of other symmetries from one sequence to another •  Because sequence annotations, alignments, and composition are all based on symmetries, can use any of them as mappings •  Discontiguous linear mapping algorithm •  Results of transformation are also symmetries
  29. 29. Coordinate Mapping (note that domain mapped to spliced transcript only overlaps two of the three exons, hence only end up with two children for resulting domain2genomic symmetry) Example – mapping domain from protein coords to genomic coords protein2mRNA proteinSpanB mrnaSpanB mRNA2genomic genomicSpanC mrnaSpanC Annotated GenomicSeq G Annotated mRNASeq M Annotated ProteinSeq P m2gSub0 gSpanC0 mSpanC0 domain2genomic proteinSpanA d2gSub0 pSpanA0 mSpanA0 gSpanA0 domain2genomic proteinSpanA mrnaSpanA domain2genomic proteinSpanA mrnaSpanA genomicSpanA d2gSub1 pSpanA1 mSpanA1 gSpanA1 transform via protein2mRNA transform via mRNA2genomic m2gSub1 gSpanC1 mSpanC1 m2gSub2 gSpanC2 mSpanC2 domainOnProtein proteinSpanA = AnnotatedBioSeq (BioSeq) = SeqSymmetry (SeqAnnot) “Growing” domain2genomic result = MutableSeqSymmetry
  30. 30. mRNA2genomic genomicSpanC mrnaSpanC m2gSub0 gSpanC0 mSpanC0 m2gSub1 gSpanC1 mSpanC1 m2gSub2 gSpanC2 mSpanC2 domain2genomic proteinSpanA mrnaSpanA domain2genomic proteinSpanA mrnaSpanA d2gSub0 mSpanA0 domain2genomic proteinSpanA mrnaSpanA d2gSub0 mSpanA0 pSpanA0 domain2genomic proteinSpanA mrnaSpanA d2gSub0 mSpanA0 pSpanA0 gSpanA0 d2gSub0 pSpanA0 mSpanA0 gSpanA0 domain2genomic proteinSpanA mrnaSpanA genomicSpanA d2gSub1 pSpanA1 mSpanA1 gSpanA1 domain2genomic proteinSpanA mrnaSpanA d2gSub0 mSpanA0 pSpanA0 gSpanA0 d2gSub1 mSpanA1 pSpanA1 gSpanA1 step1b step1cstep1a step 2 step1 (loop2) [a,b,c] Step 2 “roll up” Step 1a “sit still” Step1b “roll back” Step1c “roll forward” Step 1 Details of “split” mapping
  31. 31. Transformations Applications •  Mapping Affy probes to genome •  Mapping contig annotations to larger genomic assemblies •  Mapping protein annotations to genome •  Mapping genomic annotations to proteins and transcripts (SNPs, for example) •  Sequence slice-and-dice with annotation propagation •  Propagation of annotations across versioned sequences (such as Golden Path) •  Deep mappings (for example, SNP to genomeA to transcriptB to proteinC to homolog proteinD to transcriptE to genomeF to putative SNP location in genomeF – symmetry path of depth 5) •  Etc., etc.
  32. 32. Prototypes & Applications •  GenometryTest •  Generic Genometry Viewer •  ProtAnnot (Ann) •  GPView (Cyrus) •  AlignView (Eric) •  ContigViewer (Peter, Barry) •  Unibrow (Transcriptome Group)
  33. 33. Genometry Summary •  Genometry presents a unified model for location-based sequence relationships •  Sequence annotation, composition, and alignment are all based on SeqSymmetry •  Provides powerful genometry manipulations -- any SeqSymmetry can be used to map other SeqSymmetries across sequences / coordinate spaces •  Work in progress
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×