Unison: An Integrated Platform for
Computational Biology Discovery
Freely accessible and available at http://unison-db.org/ .

Reece Hart, Kiran Mukhyala
Genentech, Inc.

Pacific Symposium on Biocomputing 2009
assert(Sequence Analysis != Sequence Mining)

                                               feature types/models HMM, TM, signal, etc.
     sequences
                                                                                                                                Sequence Analysis
                                                                                                                                i.e., show predictions for a given sequence
                                                                                                                                Typically involves minutes to hours of computing per sequence.
                                                Typically entails days to months of computing results.
                                                i.e., show sequences that contain specified features.

                                                                                                         Feature-Based Mining
                                                                                                                                  Prediction results
     non-redundant superset of all sequences




                                                                                                                                  method-specific data such as score, e-
                                                                                                                                  value, p-value, kinase probability, etc.




                                                                                                                                                                        parameters
                                                                                                                                                               execution arguments/options for
                                                                                                                                                               every prediction type and result
Unison in a Nutshell




                           Domain,
                                                              Structures
                    Structure & Homology
                                                              & Ligands
                         Predictions

                                            Protein
                                         Sequences and
                                          Annotations

                         Genomes,                            Auxiliary
                       Gene Mapping &                      Annotations
                         Structure,                       GO, RIF, SCOP,
                          Probes                               etc.



       Sequences and Annotations          Auxiliary Data      Precomputed predictions
  UniProt, IPI, Ensembl, RefSeq, PDB    HomoloGene, Gene      Domains, homology, structure, TMs,
STRING, PHANTOM, HUGE, ROUGE,           Ontology, taxonomy,   localization, signals, disorder, etc.
        MGC, Derwent, pataa, nr, etc.   PDB, HUGO, SCOP,      >200M predictions, 23 types,
  >13M seqs, >17k species, 69 origins           etc.          ~6 CPU-years
Unison has many applications.
Unison Web Tools                                   Other In-House Tools                                                  Ad Hoc Mining



                                                                                                                             Mining and
                                                                                                                             analysis
                                                                                                                             projects




                                              Domain,
                                                                                 Structures
                                       Structure & Homology
                                                                                 & Ligands
                                            Predictions

                                                               Protein
                                                            Sequences and
                                                             Annotations

                                            Genomes,                            Auxiliary
                                          Gene Mapping &                      Annotations
                                            Structure,                       GO, RIF, SCOP,
                                             Probes                               etc.



                          Sequences and Annotations          Auxiliary Data      Precomputed predictions
                     UniProt, IPI, Ensembl, RefSeq, PDB    HomoloGene, Gene      Domains, homology, structure, TMs,
                   STRING, PHANTOM, HUGE, ROUGE,           Ontology, taxonomy,   localization, signals, disorder, etc.
                           MGC, Derwent, pataa, nr, etc.   PDB, HUGO, SCOP,      >200M predictions, 23 types,
                     >13M seqs, >17k species, 69 origins           etc.          ~6 CPU-years
Unison Web Tools
Unison is a platform for diverse tools.




                                    Matt Brauer
                                    Guy Cavet
                                    Josh Kaminker
                                    Scott Lohr
                                    Kathryn Woods
                                    Jean Yuan
                                    Peng Yue
Unison facilitates complex mining.




Mining for TNF ligands
Mining for E3 Ligases
Mining for 4H Cytokines
Mining for ITxM
Mining for deubiquitinases
Analyzing SNP impact on binding interfaces




                                             Jason Hackney
                                             Nandini Krishnamurthy
                                             Li Li
                                             Yun Li
                                             Jinfeng Liu
                                             Shiu-ming Loh
                                             Kiran Mukhyala
Mining for ITIMs the old way.

                 Ig      TM       ITIM


➢   Collect sequences.
➢   Prune redundant sequences. (How?!)
➢   For each unique sequence, predict
    ●   Immunoglobulin domains.
    ●   Transmembrane domains.
    ●   ITIM domains.
➢   Write a program that filters predictions.
➢   Summarize hits with external data.
➢   Do it again when source data are updated.
Mining for ITIMs the Unison way.

                             Ig                   TM             ITIM
SELECT IG.pseq_id,
        IG.start as ig_start,IG.stop as ig_stop,IG.score,IG.eval,
        TM.start as tm_start,TM.stop as tm_stop,
        ITIM.start as itim_start,ITIM.stop as itim_stop
 FROM pahmm_current_pfam_v IG
 JOIN pftmhmm_tms_v TM ON IG.pseq_id=TM.pseq_id                          AND IG.stop<TM.start
 JOIN pfregexp_v ITIM             ON TM.pseq_id=ITIM.pseq_id AND TM.stop<ITIM.start
WHERE IG.name='ig' AND IG.eval<1e-2
        AND ITIM.acc='MOD_TYR_ITIM';

               Ig     Ig                   TM      Tm    ITIM     ITIM
  pseq_id   start   stop score     eval   start   stop   start    stop best_annotation
      234    262     316    30 7.40E-06    440     462    518      523 UniProtKB/Swiss-Prot:SIGL5_HUMAN (RecName: Fu
      254    158     213    36 1.90E-07    284     306    386      391 UniProtKB/Swiss-Prot:VSIG4_HUMAN (RecName: F
      544    157     215    24 6.60E-04    348     370    431      436 UniProtKB/Swiss-Prot:SIGL9_HUMAN (RecName: Fu
      797    254     312    40 7.60E-09   1099    1121   1361     1366 UniProtKB/Swiss-Prot:DCC_HUMAN (RecName: Ful
     1113     42     102    30 1.20E-05    243     265    300      305 UniProtKB/Swiss-Prot:KI2L2_HUMAN (RecName: Fu
     1114     42     102    30 6.50E-06    243     265    330      335 UniProtKB/Swiss-Prot:KI2L1_HUMAN (RecName: Fu
     1115     42     102    31 4.20E-06    243     265    301      306 UniProtKB/Swiss-Prot:KI2L3_HUMAN (RecName: Fu
     1116     42      97    30 1.10E-05    339     361    396      401 UniProtKB/TrEMBL:Q95368_HUMAN (SubName: Fu
     1134    340     388    26 1.40E-04    603     625    688      693 UniProtKB/Swiss-Prot:PECA1_HUMAN (RecName: F
“Are you sure about this Stan? It seems odd that a
pointy head and a long beak is what makes them fly.”
                              J. Workman, Science 245:1399 (1989)
Kiran Mukhyala

Fernando Bazan, Matt Brauer, Jason Hackney, Pete Haverty,
Ken Jung, Josh Kaminker, Nandini Krishnamurthy, Li Li, Yun Li,
Shiuh-ming Loh, Jinfeng Liu, Peng Yue, Jianjun Zhang, Yan Zhang

http://unison-db.org/
Open access web site, downloads, documentation, references

unison-db.org:5432
PostgreSQL & odbc/jdbc/sdbc access
Unison Contents
  patents                       HUGO
  Geneseq:AAP60074              TNFSF9
  1991-10-29
  SUNTORY
                                TNFSF10
                                TNFSF11
                                                      homologs
                                                      NP_000585.2 NP_036807.1 | RAT
  EP205038-A; New tumour...
                                                      NP_000585.2 NP_038721.1 | MOUSE
                                                      NP_000585.2 XP_858423.1 | CANFA


 GO                                                                                                      SNPs
 Function                                                                                                P84L
   transcription                                                                                         A94T
      initiation
      elongation
                       aliases
                       TNFA_HUMAN
Entrez                 Q1XHZ6
                       IPI00001671.1
                                                      sequences                         protein features
gene_id                                               >Unison:98
                       INCY:1109711.FL1p
symbol                                                MSTESMIRDVE...FGIIAL
                       CCDS4702.1
locus                                                 >Unison:23782
                       gi:25952111
                                                      VRSSSRTPSD...FGIIAL                  1   |    23   |         | SS
                                                                                         108   |   143   | 1.8e-06 | EGF
                                                                                         162   |   184   |         | TM

taxonomy                                                           alignments
                                                                                         133   |   138   |         | ITIM

9606 Homo sapiens
10090 Mus musculus                                                 TNFA 1tnfA
10028 Rattus rattus                                                TNFA 1tnfB
                                                                                                   aa-to-resid
                              loci                                 ...
                                                                   TNFA 5tswF                      MSTESMIR
                                                                                                   DVEFGIIA
                                1 233 6+:31651498-31653288
                                                                                                   TESMIRDV
                                                                                                   IIAMDAC

                                                                                structures
                                                                                1tnf                            SCOP
  genomes                                                                       1a8m                            all alpha
  Hs35
  Hs36
                                            probes                              2tun
                                                                                4tsv
                                                                                                                all beta
                                                                                                                 Ig
                                            HGU133P                             5tsw                             TNF-like
  RAT
                                            WHG                                                                 alpha+beta
Ex1: Mine for sequences w/conserved features.
  patents                       HUGO
  Geneseq:AAP60074              TNFSF9
  1991-10-29
  SUNTORY
                                TNFSF10
                                TNFSF11
                                                      homologs
                                                      NP_000585.2 NP_036807.1 | RAT
  EP205038-A; New tumour...
                                                      NP_000585.2 NP_038721.1 | MOUSE
                                                      NP_000585.2 XP_858423.1 | CANFA


 GO                                                                                                      SNPs
 Function                                                                                                P84L
   transcription                                                                                         A94T
      initiation
      elongation
                       aliases
                       TNFA_HUMAN
Entrez                 Q1XHZ6
                       IPI00001671.1
                                                      sequences                         protein features
gene_id                                               >Unison:98
                       INCY:1109711.FL1p
symbol                                                MSTESMIRDVE...FGIIAL
                       CCDS4702.1
locus                                                 >Unison:23782
                       gi:25952111
                                                      VRSSSRTPSD...FGIIAL                  1   |    23   |         | SS
                                                                                         108   |   143   | 1.8e-06 | EGF
                                                                                         162   |   184   |         | TM

taxonomy                                                           alignments
                                                                                         133   |   138   |         | ITIM

9606 Homo sapiens
10090 Mus musculus                                                 TNFA 1tnfA
10028 Rattus rattus                                                TNFA 1tnfB
                                                                                                   aa-to-resid
                              loci                                 ...
                                                                   TNFA 5tswF                      MSTESMIR
                                                                                                   DVEFGIIA
                                1 233 6+:31651498-31653288
                                                                                                   TESMIRDV
                                                                                                   IIAMDAC

                                                                                structures
                                                                                1tnf                            SCOP
  genomes                                                                       1a8m                            all alpha
  Hs35
  Hs36
                                            probes                              2tun
                                                                                4tsv
                                                                                                                all beta
                                                                                                                 Ig
                                            HGU133P                             5tsw                             TNF-like
  RAT
                                            WHG                                                                 alpha+beta
Ex2: Locate SNPs and Domains on Structure
  patents                       HUGO
  Geneseq:AAP60074              TNFSF9
  1991-10-29
  SUNTORY
                                TNFSF10
                                TNFSF11
                                                      homologs
                                                      NP_000585.2 NP_036807.1 | RAT
  EP205038-A; New tumour...
                                                      NP_000585.2 NP_038721.1 | MOUSE
                                                      NP_000585.2 XP_858423.1 | CANFA


 GO                                                                                                      SNPs
 Function                                                                                                P84L
   transcription                                                                                         A94T
      initiation
      elongation
                       aliases
                       TNFA_HUMAN
Entrez                 Q1XHZ6
                       IPI00001671.1
                                                      sequences                         protein features
gene_id                                               >Unison:98
                       INCY:1109711.FL1p
symbol                                                MSTESMIRDVE...FGIIAL
                       CCDS4702.1
locus                                                 >Unison:23782
                       gi:25952111
                                                      VRSSSRTPSD...FGIIAL                  1   |    23   |         | SS
                                                                                         108   |   143   | 1.8e-06 | EGF
                                                                                         162   |   184   |         | TM

taxonomy                                                           alignments
                                                                                         133   |   138   |         | ITIM

9606 Homo sapiens
10090 Mus musculus                                                 TNFA 1tnfA
10028 Rattus rattus                                                TNFA 1tnfB
                                                                                                   aa-to-resid
                              loci                                 ...
                                                                   TNFA 5tswF                      MSTESMIR
                                                                                                   DVEFGIIA
                                1 233 6+:31651498-31653288
                                                                                                   TESMIRDV
                                                                                                   IIAMDAC

                                                                                structures
                                                                                1tnf                            SCOP
  genomes                                                                       1a8m                            all alpha
  Hs35
  Hs36
                                            probes                              2tun
                                                                                4tsv
                                                                                                                all beta
                                                                                                                 Ig
                                            HGU133P                             5tsw                             TNF-like
  RAT
                                            WHG                                                                 alpha+beta
Unison can also help you...
➢   Answer more sophisticated questions.
    ●   Require orthologs or a specified exon structure.
➢   Annotate hits.
    ●   Annotate with locus, probes, HUGO gene name,
        structures, PubMed refs, external links.
    ●   Group splice forms by locus.
➢   Explore alternatives.
    ●   How do parameters influence results?
    ●   Try other prediction algorithms.
➢   Stay current.
    ●   When new data are available, just rerun the query.
➢   Move on.
    ●   The same data are available to other projects and
        other people.

Unison: An Integrated Platform for Computational Biology Discovery

  • 1.
    Unison: An IntegratedPlatform for Computational Biology Discovery Freely accessible and available at http://unison-db.org/ . Reece Hart, Kiran Mukhyala Genentech, Inc. Pacific Symposium on Biocomputing 2009
  • 2.
    assert(Sequence Analysis !=Sequence Mining) feature types/models HMM, TM, signal, etc. sequences Sequence Analysis i.e., show predictions for a given sequence Typically involves minutes to hours of computing per sequence. Typically entails days to months of computing results. i.e., show sequences that contain specified features. Feature-Based Mining Prediction results non-redundant superset of all sequences method-specific data such as score, e- value, p-value, kinase probability, etc. parameters execution arguments/options for every prediction type and result
  • 3.
    Unison in aNutshell Domain, Structures Structure & Homology & Ligands Predictions Protein Sequences and Annotations Genomes, Auxiliary Gene Mapping & Annotations Structure, GO, RIF, SCOP, Probes etc. Sequences and Annotations Auxiliary Data Precomputed predictions UniProt, IPI, Ensembl, RefSeq, PDB HomoloGene, Gene Domains, homology, structure, TMs, STRING, PHANTOM, HUGE, ROUGE, Ontology, taxonomy, localization, signals, disorder, etc. MGC, Derwent, pataa, nr, etc. PDB, HUGO, SCOP, >200M predictions, 23 types, >13M seqs, >17k species, 69 origins etc. ~6 CPU-years
  • 4.
    Unison has manyapplications. Unison Web Tools Other In-House Tools Ad Hoc Mining Mining and analysis projects Domain, Structures Structure & Homology & Ligands Predictions Protein Sequences and Annotations Genomes, Auxiliary Gene Mapping & Annotations Structure, GO, RIF, SCOP, Probes etc. Sequences and Annotations Auxiliary Data Precomputed predictions UniProt, IPI, Ensembl, RefSeq, PDB HomoloGene, Gene Domains, homology, structure, TMs, STRING, PHANTOM, HUGE, ROUGE, Ontology, taxonomy, localization, signals, disorder, etc. MGC, Derwent, pataa, nr, etc. PDB, HUGO, SCOP, >200M predictions, 23 types, >13M seqs, >17k species, 69 origins etc. ~6 CPU-years
  • 5.
  • 6.
    Unison is aplatform for diverse tools. Matt Brauer Guy Cavet Josh Kaminker Scott Lohr Kathryn Woods Jean Yuan Peng Yue
  • 7.
    Unison facilitates complexmining. Mining for TNF ligands Mining for E3 Ligases Mining for 4H Cytokines Mining for ITxM Mining for deubiquitinases Analyzing SNP impact on binding interfaces Jason Hackney Nandini Krishnamurthy Li Li Yun Li Jinfeng Liu Shiu-ming Loh Kiran Mukhyala
  • 8.
    Mining for ITIMsthe old way. Ig TM ITIM ➢ Collect sequences. ➢ Prune redundant sequences. (How?!) ➢ For each unique sequence, predict ● Immunoglobulin domains. ● Transmembrane domains. ● ITIM domains. ➢ Write a program that filters predictions. ➢ Summarize hits with external data. ➢ Do it again when source data are updated.
  • 9.
    Mining for ITIMsthe Unison way. Ig TM ITIM SELECT IG.pseq_id, IG.start as ig_start,IG.stop as ig_stop,IG.score,IG.eval, TM.start as tm_start,TM.stop as tm_stop, ITIM.start as itim_start,ITIM.stop as itim_stop FROM pahmm_current_pfam_v IG JOIN pftmhmm_tms_v TM ON IG.pseq_id=TM.pseq_id AND IG.stop<TM.start JOIN pfregexp_v ITIM ON TM.pseq_id=ITIM.pseq_id AND TM.stop<ITIM.start WHERE IG.name='ig' AND IG.eval<1e-2 AND ITIM.acc='MOD_TYR_ITIM'; Ig Ig TM Tm ITIM ITIM pseq_id start stop score eval start stop start stop best_annotation 234 262 316 30 7.40E-06 440 462 518 523 UniProtKB/Swiss-Prot:SIGL5_HUMAN (RecName: Fu 254 158 213 36 1.90E-07 284 306 386 391 UniProtKB/Swiss-Prot:VSIG4_HUMAN (RecName: F 544 157 215 24 6.60E-04 348 370 431 436 UniProtKB/Swiss-Prot:SIGL9_HUMAN (RecName: Fu 797 254 312 40 7.60E-09 1099 1121 1361 1366 UniProtKB/Swiss-Prot:DCC_HUMAN (RecName: Ful 1113 42 102 30 1.20E-05 243 265 300 305 UniProtKB/Swiss-Prot:KI2L2_HUMAN (RecName: Fu 1114 42 102 30 6.50E-06 243 265 330 335 UniProtKB/Swiss-Prot:KI2L1_HUMAN (RecName: Fu 1115 42 102 31 4.20E-06 243 265 301 306 UniProtKB/Swiss-Prot:KI2L3_HUMAN (RecName: Fu 1116 42 97 30 1.10E-05 339 361 396 401 UniProtKB/TrEMBL:Q95368_HUMAN (SubName: Fu 1134 340 388 26 1.40E-04 603 625 688 693 UniProtKB/Swiss-Prot:PECA1_HUMAN (RecName: F
  • 10.
    “Are you sureabout this Stan? It seems odd that a pointy head and a long beak is what makes them fly.” J. Workman, Science 245:1399 (1989)
  • 11.
    Kiran Mukhyala Fernando Bazan,Matt Brauer, Jason Hackney, Pete Haverty, Ken Jung, Josh Kaminker, Nandini Krishnamurthy, Li Li, Yun Li, Shiuh-ming Loh, Jinfeng Liu, Peng Yue, Jianjun Zhang, Yan Zhang http://unison-db.org/ Open access web site, downloads, documentation, references unison-db.org:5432 PostgreSQL & odbc/jdbc/sdbc access
  • 12.
    Unison Contents patents HUGO Geneseq:AAP60074 TNFSF9 1991-10-29 SUNTORY TNFSF10 TNFSF11 homologs NP_000585.2 NP_036807.1 | RAT EP205038-A; New tumour... NP_000585.2 NP_038721.1 | MOUSE NP_000585.2 XP_858423.1 | CANFA GO SNPs Function P84L transcription A94T initiation elongation aliases TNFA_HUMAN Entrez Q1XHZ6 IPI00001671.1 sequences protein features gene_id >Unison:98 INCY:1109711.FL1p symbol MSTESMIRDVE...FGIIAL CCDS4702.1 locus >Unison:23782 gi:25952111 VRSSSRTPSD...FGIIAL 1 | 23 | | SS 108 | 143 | 1.8e-06 | EGF 162 | 184 | | TM taxonomy alignments 133 | 138 | | ITIM 9606 Homo sapiens 10090 Mus musculus TNFA 1tnfA 10028 Rattus rattus TNFA 1tnfB aa-to-resid loci ... TNFA 5tswF MSTESMIR DVEFGIIA 1 233 6+:31651498-31653288 TESMIRDV IIAMDAC structures 1tnf SCOP genomes 1a8m all alpha Hs35 Hs36 probes 2tun 4tsv all beta Ig HGU133P 5tsw TNF-like RAT WHG alpha+beta
  • 13.
    Ex1: Mine forsequences w/conserved features. patents HUGO Geneseq:AAP60074 TNFSF9 1991-10-29 SUNTORY TNFSF10 TNFSF11 homologs NP_000585.2 NP_036807.1 | RAT EP205038-A; New tumour... NP_000585.2 NP_038721.1 | MOUSE NP_000585.2 XP_858423.1 | CANFA GO SNPs Function P84L transcription A94T initiation elongation aliases TNFA_HUMAN Entrez Q1XHZ6 IPI00001671.1 sequences protein features gene_id >Unison:98 INCY:1109711.FL1p symbol MSTESMIRDVE...FGIIAL CCDS4702.1 locus >Unison:23782 gi:25952111 VRSSSRTPSD...FGIIAL 1 | 23 | | SS 108 | 143 | 1.8e-06 | EGF 162 | 184 | | TM taxonomy alignments 133 | 138 | | ITIM 9606 Homo sapiens 10090 Mus musculus TNFA 1tnfA 10028 Rattus rattus TNFA 1tnfB aa-to-resid loci ... TNFA 5tswF MSTESMIR DVEFGIIA 1 233 6+:31651498-31653288 TESMIRDV IIAMDAC structures 1tnf SCOP genomes 1a8m all alpha Hs35 Hs36 probes 2tun 4tsv all beta Ig HGU133P 5tsw TNF-like RAT WHG alpha+beta
  • 14.
    Ex2: Locate SNPsand Domains on Structure patents HUGO Geneseq:AAP60074 TNFSF9 1991-10-29 SUNTORY TNFSF10 TNFSF11 homologs NP_000585.2 NP_036807.1 | RAT EP205038-A; New tumour... NP_000585.2 NP_038721.1 | MOUSE NP_000585.2 XP_858423.1 | CANFA GO SNPs Function P84L transcription A94T initiation elongation aliases TNFA_HUMAN Entrez Q1XHZ6 IPI00001671.1 sequences protein features gene_id >Unison:98 INCY:1109711.FL1p symbol MSTESMIRDVE...FGIIAL CCDS4702.1 locus >Unison:23782 gi:25952111 VRSSSRTPSD...FGIIAL 1 | 23 | | SS 108 | 143 | 1.8e-06 | EGF 162 | 184 | | TM taxonomy alignments 133 | 138 | | ITIM 9606 Homo sapiens 10090 Mus musculus TNFA 1tnfA 10028 Rattus rattus TNFA 1tnfB aa-to-resid loci ... TNFA 5tswF MSTESMIR DVEFGIIA 1 233 6+:31651498-31653288 TESMIRDV IIAMDAC structures 1tnf SCOP genomes 1a8m all alpha Hs35 Hs36 probes 2tun 4tsv all beta Ig HGU133P 5tsw TNF-like RAT WHG alpha+beta
  • 15.
    Unison can alsohelp you... ➢ Answer more sophisticated questions. ● Require orthologs or a specified exon structure. ➢ Annotate hits. ● Annotate with locus, probes, HUGO gene name, structures, PubMed refs, external links. ● Group splice forms by locus. ➢ Explore alternatives. ● How do parameters influence results? ● Try other prediction algorithms. ➢ Stay current. ● When new data are available, just rerun the query. ➢ Move on. ● The same data are available to other projects and other people.