Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Unison: An Integrated Platform for Computational Biology Discovery
1. Unison: An Integrated Platform for
Computational Biology Discovery
Freely accessible and available at http://unison-db.org/ .
Reece Hart, Kiran Mukhyala
Genentech, Inc.
Pacific Symposium on Biocomputing 2009
2. assert(Sequence Analysis != Sequence Mining)
feature types/models HMM, TM, signal, etc.
sequences
Sequence Analysis
i.e., show predictions for a given sequence
Typically involves minutes to hours of computing per sequence.
Typically entails days to months of computing results.
i.e., show sequences that contain specified features.
Feature-Based Mining
Prediction results
non-redundant superset of all sequences
method-specific data such as score, e-
value, p-value, kinase probability, etc.
parameters
execution arguments/options for
every prediction type and result
3. Unison in a Nutshell
Domain,
Structures
Structure & Homology
& Ligands
Predictions
Protein
Sequences and
Annotations
Genomes, Auxiliary
Gene Mapping & Annotations
Structure, GO, RIF, SCOP,
Probes etc.
Sequences and Annotations Auxiliary Data Precomputed predictions
UniProt, IPI, Ensembl, RefSeq, PDB HomoloGene, Gene Domains, homology, structure, TMs,
STRING, PHANTOM, HUGE, ROUGE, Ontology, taxonomy, localization, signals, disorder, etc.
MGC, Derwent, pataa, nr, etc. PDB, HUGO, SCOP, >200M predictions, 23 types,
>13M seqs, >17k species, 69 origins etc. ~6 CPU-years
4. Unison has many applications.
Unison Web Tools Other In-House Tools Ad Hoc Mining
Mining and
analysis
projects
Domain,
Structures
Structure & Homology
& Ligands
Predictions
Protein
Sequences and
Annotations
Genomes, Auxiliary
Gene Mapping & Annotations
Structure, GO, RIF, SCOP,
Probes etc.
Sequences and Annotations Auxiliary Data Precomputed predictions
UniProt, IPI, Ensembl, RefSeq, PDB HomoloGene, Gene Domains, homology, structure, TMs,
STRING, PHANTOM, HUGE, ROUGE, Ontology, taxonomy, localization, signals, disorder, etc.
MGC, Derwent, pataa, nr, etc. PDB, HUGO, SCOP, >200M predictions, 23 types,
>13M seqs, >17k species, 69 origins etc. ~6 CPU-years
6. Unison is a platform for diverse tools.
Matt Brauer
Guy Cavet
Josh Kaminker
Scott Lohr
Kathryn Woods
Jean Yuan
Peng Yue
7. Unison facilitates complex mining.
Mining for TNF ligands
Mining for E3 Ligases
Mining for 4H Cytokines
Mining for ITxM
Mining for deubiquitinases
Analyzing SNP impact on binding interfaces
Jason Hackney
Nandini Krishnamurthy
Li Li
Yun Li
Jinfeng Liu
Shiu-ming Loh
Kiran Mukhyala
8. Mining for ITIMs the old way.
Ig TM ITIM
➢ Collect sequences.
➢ Prune redundant sequences. (How?!)
➢ For each unique sequence, predict
● Immunoglobulin domains.
● Transmembrane domains.
● ITIM domains.
➢ Write a program that filters predictions.
➢ Summarize hits with external data.
➢ Do it again when source data are updated.
9. Mining for ITIMs the Unison way.
Ig TM ITIM
SELECT IG.pseq_id,
IG.start as ig_start,IG.stop as ig_stop,IG.score,IG.eval,
TM.start as tm_start,TM.stop as tm_stop,
ITIM.start as itim_start,ITIM.stop as itim_stop
FROM pahmm_current_pfam_v IG
JOIN pftmhmm_tms_v TM ON IG.pseq_id=TM.pseq_id AND IG.stop<TM.start
JOIN pfregexp_v ITIM ON TM.pseq_id=ITIM.pseq_id AND TM.stop<ITIM.start
WHERE IG.name='ig' AND IG.eval<1e-2
AND ITIM.acc='MOD_TYR_ITIM';
Ig Ig TM Tm ITIM ITIM
pseq_id start stop score eval start stop start stop best_annotation
234 262 316 30 7.40E-06 440 462 518 523 UniProtKB/Swiss-Prot:SIGL5_HUMAN (RecName: Fu
254 158 213 36 1.90E-07 284 306 386 391 UniProtKB/Swiss-Prot:VSIG4_HUMAN (RecName: F
544 157 215 24 6.60E-04 348 370 431 436 UniProtKB/Swiss-Prot:SIGL9_HUMAN (RecName: Fu
797 254 312 40 7.60E-09 1099 1121 1361 1366 UniProtKB/Swiss-Prot:DCC_HUMAN (RecName: Ful
1113 42 102 30 1.20E-05 243 265 300 305 UniProtKB/Swiss-Prot:KI2L2_HUMAN (RecName: Fu
1114 42 102 30 6.50E-06 243 265 330 335 UniProtKB/Swiss-Prot:KI2L1_HUMAN (RecName: Fu
1115 42 102 31 4.20E-06 243 265 301 306 UniProtKB/Swiss-Prot:KI2L3_HUMAN (RecName: Fu
1116 42 97 30 1.10E-05 339 361 396 401 UniProtKB/TrEMBL:Q95368_HUMAN (SubName: Fu
1134 340 388 26 1.40E-04 603 625 688 693 UniProtKB/Swiss-Prot:PECA1_HUMAN (RecName: F
10. “Are you sure about this Stan? It seems odd that a
pointy head and a long beak is what makes them fly.”
J. Workman, Science 245:1399 (1989)
11. Kiran Mukhyala
Fernando Bazan, Matt Brauer, Jason Hackney, Pete Haverty,
Ken Jung, Josh Kaminker, Nandini Krishnamurthy, Li Li, Yun Li,
Shiuh-ming Loh, Jinfeng Liu, Peng Yue, Jianjun Zhang, Yan Zhang
http://unison-db.org/
Open access web site, downloads, documentation, references
unison-db.org:5432
PostgreSQL & odbc/jdbc/sdbc access
12. Unison Contents
patents HUGO
Geneseq:AAP60074 TNFSF9
1991-10-29
SUNTORY
TNFSF10
TNFSF11
homologs
NP_000585.2 NP_036807.1 | RAT
EP205038-A; New tumour...
NP_000585.2 NP_038721.1 | MOUSE
NP_000585.2 XP_858423.1 | CANFA
GO SNPs
Function P84L
transcription A94T
initiation
elongation
aliases
TNFA_HUMAN
Entrez Q1XHZ6
IPI00001671.1
sequences protein features
gene_id >Unison:98
INCY:1109711.FL1p
symbol MSTESMIRDVE...FGIIAL
CCDS4702.1
locus >Unison:23782
gi:25952111
VRSSSRTPSD...FGIIAL 1 | 23 | | SS
108 | 143 | 1.8e-06 | EGF
162 | 184 | | TM
taxonomy alignments
133 | 138 | | ITIM
9606 Homo sapiens
10090 Mus musculus TNFA 1tnfA
10028 Rattus rattus TNFA 1tnfB
aa-to-resid
loci ...
TNFA 5tswF MSTESMIR
DVEFGIIA
1 233 6+:31651498-31653288
TESMIRDV
IIAMDAC
structures
1tnf SCOP
genomes 1a8m all alpha
Hs35
Hs36
probes 2tun
4tsv
all beta
Ig
HGU133P 5tsw TNF-like
RAT
WHG alpha+beta
13. Ex1: Mine for sequences w/conserved features.
patents HUGO
Geneseq:AAP60074 TNFSF9
1991-10-29
SUNTORY
TNFSF10
TNFSF11
homologs
NP_000585.2 NP_036807.1 | RAT
EP205038-A; New tumour...
NP_000585.2 NP_038721.1 | MOUSE
NP_000585.2 XP_858423.1 | CANFA
GO SNPs
Function P84L
transcription A94T
initiation
elongation
aliases
TNFA_HUMAN
Entrez Q1XHZ6
IPI00001671.1
sequences protein features
gene_id >Unison:98
INCY:1109711.FL1p
symbol MSTESMIRDVE...FGIIAL
CCDS4702.1
locus >Unison:23782
gi:25952111
VRSSSRTPSD...FGIIAL 1 | 23 | | SS
108 | 143 | 1.8e-06 | EGF
162 | 184 | | TM
taxonomy alignments
133 | 138 | | ITIM
9606 Homo sapiens
10090 Mus musculus TNFA 1tnfA
10028 Rattus rattus TNFA 1tnfB
aa-to-resid
loci ...
TNFA 5tswF MSTESMIR
DVEFGIIA
1 233 6+:31651498-31653288
TESMIRDV
IIAMDAC
structures
1tnf SCOP
genomes 1a8m all alpha
Hs35
Hs36
probes 2tun
4tsv
all beta
Ig
HGU133P 5tsw TNF-like
RAT
WHG alpha+beta
14. Ex2: Locate SNPs and Domains on Structure
patents HUGO
Geneseq:AAP60074 TNFSF9
1991-10-29
SUNTORY
TNFSF10
TNFSF11
homologs
NP_000585.2 NP_036807.1 | RAT
EP205038-A; New tumour...
NP_000585.2 NP_038721.1 | MOUSE
NP_000585.2 XP_858423.1 | CANFA
GO SNPs
Function P84L
transcription A94T
initiation
elongation
aliases
TNFA_HUMAN
Entrez Q1XHZ6
IPI00001671.1
sequences protein features
gene_id >Unison:98
INCY:1109711.FL1p
symbol MSTESMIRDVE...FGIIAL
CCDS4702.1
locus >Unison:23782
gi:25952111
VRSSSRTPSD...FGIIAL 1 | 23 | | SS
108 | 143 | 1.8e-06 | EGF
162 | 184 | | TM
taxonomy alignments
133 | 138 | | ITIM
9606 Homo sapiens
10090 Mus musculus TNFA 1tnfA
10028 Rattus rattus TNFA 1tnfB
aa-to-resid
loci ...
TNFA 5tswF MSTESMIR
DVEFGIIA
1 233 6+:31651498-31653288
TESMIRDV
IIAMDAC
structures
1tnf SCOP
genomes 1a8m all alpha
Hs35
Hs36
probes 2tun
4tsv
all beta
Ig
HGU133P 5tsw TNF-like
RAT
WHG alpha+beta
15. Unison can also help you...
➢ Answer more sophisticated questions.
● Require orthologs or a specified exon structure.
➢ Annotate hits.
● Annotate with locus, probes, HUGO gene name,
structures, PubMed refs, external links.
● Group splice forms by locus.
➢ Explore alternatives.
● How do parameters influence results?
● Try other prediction algorithms.
➢ Stay current.
● When new data are available, just rerun the query.
➢ Move on.
● The same data are available to other projects and
other people.