2. Outline
• Onion
• Main InterPro pipeline: predict protein families, domains, sites
• A lot
• CluSTr
• Automatic classification of proteins based on sequence similarity
• A little
InterPro data pipelines: What‘s in it for me?9 April 2008
4. InterPro data pipelines: What‘s in it for me?9 April 2008
Mission: to explore strange new proteins…
Onion + Protein Sequence
=
Prediction of Functional Annotation
InterPro
Protein families, domains, repeats and sites
5. InterPro data pipelines: What‘s in it for me?9 April 2008
Requirements
• Handle all member databases and algorithms
• HMMER (eg. Gene3D, PANTHER)
• Regular expressions (PROSITE)
• SignalP
• TMHMM
• BLAST (PIRSF)
• FingerPRINTScan (PRINTS)
• Fast
• Wide and deep coverage
6. InterPro data pipelines: What‘s in it for me?9 April 2008
Design
• UniParc
• Solves mapping problem
• Sequential IDs
• Comprehensive – many DBs, all sequences
• Method Archive
• Minimise calculations
• Read flat files once
• Decoupled analysis and post-processing
7. InterPro data pipelines: What‘s in it for me?9 April 2008
The Trinity
Onion
UniParcMethod
Archive
Member
database
methods
Protein
sequences
8. InterPro data pipelines: What‘s in it for me?9 April 2008
New sequences
UniParc
Onion
New
sequences
Run
against all
methods
UniParcMethod
Archive
9. InterPro data pipelines: What‘s in it for me?9 April 2008
Member database release
Onion
UniParc
Method
archive
Method
Archive
Run new and
changed methods
against all
sequences
Advantages:
• If only post-processing
or cut-off changed – only
run that part
• No change – no need to
rerun
Methods
added,
changed
or deleted
10. InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
UniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
New release of model database – search new and
changed models against all of UniParc
anthill
11. InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTA
file
1000s of
model files
anthill
UniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
12. InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTA
file
1000s of
model files
anthill
bsubsumissioncmds
UniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
13. InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTA
file
1000s of
model files
LSF
anthill
bsubsumissioncmds
UniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
14. InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTA
file
1000s of
model files
LSF
anthill
bsubsumissioncmds
output files
(raw results)
SQL*Loader
file
Parse,
reformatUniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
15. InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTA
file
1000s of
model files
LSF
anthill
bsubsumissioncmds
output files
(raw results)
SQL*Loader
file
Parse,
reformat
Load
ONION
Raw results table
UniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
16. InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTA
file
1000s of
model files
LSF
anthill
bsubsumissioncmds
output files
(raw results)
SQL*Loader
file
Parse,
reformat
Load
ONION
Raw results table
post-
processing
Final results table
UniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
17. InterPro data pipelines: What‘s in it for me?9 April 2008
“Drip” mode (automatic)
UniParc
New sequences– search all models every 4 minutes
anthill
extract new
sequences
HMM
flatfile
Profile
flatfile
FPrint
flatfile
18. InterPro data pipelines: What‘s in it for me?9 April 2008
“Drip” mode (automatic)
UniParc
LSF
anthill
bsubsumissioncmds
output files
(raw results)
extract new
sequences
HMM
flatfile
Profile
flatfile
FPrint
flatfile
19. InterPro data pipelines: What‘s in it for me?9 April 2008
UniParc
LSF
anthill
bsubsumissioncmds
output files
(raw results)
Parse,
reformat
and load
extract new
sequences
ONION
Raw results table
post-
processing
Final results table
HMM
flatfile
Profile
flatfile
FPrint
flatfile
“Drip” mode (automatic)
20. InterPro data pipelines: What‘s in it for me?9 April 2008
pirsf
pantherScoreassignment
HMMER
Pfam TIGRFAM SMART SUPERFAMILYGENE3D PIRSF PANTHER
GA
cut-off
TC
cut-off
E-value
cut-off
E-value
cut-off
AM filter
clan
nested
threshold
(kinase)
domainFinder
sequence
Oracle (raw data)
Oracle (refined data)
The refinery
21. InterPro data pipelines: What‘s in it for me?9 April 2008
Onion vs InterProScan
• Similarities
• Software: HMMER, TMHMM, SignalP
• Models: Pfam, Gene3D, PRINTS …etc
• Differences
• Internal use only
• Decoupled analysis and post-processing
• Java + database
• Faster
22. InterPro data pipelines: What‘s in it for me?9 April 2008
Limitations
• Database design
• Inflexible – single member DB version
• Redundant
• Tight coupling
• Internal
• Difficult to test/debug
• External
• Oracle
• LSF
• File system
23. InterPro data pipelines: What‘s in it for me?9 April 2008
Plans
• Merge InterProScan
• Single code base = reduced maintenance cost
• Java (Java 5? Spring? Maven?)
• Database (Oracle, Derby?, Hibernate, Java stored procs?)
• Testable
• JUnit
• Continous integration?
• API
• Java (web services?)
• Oracle: views, stored procs
24. InterPro data pipelines: What‘s in it for me?9 April 2008
What’s in it for me?
• UniProt curators
• On-demand sequence analysis?
• Ensembl production
• InterPro hits
• Pre- or post-UniParc?
25. InterPro data pipelines: What‘s in it for me?9 April 2008
CluSTr
• Input: UniProtKB, IPI, Ensembl Human – 6 million sequences
• Output:
• Similiarity scores (Smith-Waterman) – 3.5 billion
• Clusters (single linkage, aka nearest neighbour)
• Orthologues (best reciprocal hit) – 627 species
• Every 3 weeks (UniProt cycle)
• Availability: Oracle, web app, FTP (sims + GO mappings)
• Customers
• integr8 (orthologues)
• Druggable Genome (similarities)
• Potential
• Set-based analyses
• Similarities on-demand
26. InterPro data pipelines: What‘s in it for me?9 April 2008
Acknowledgements
• InterPro
• Robert Petryszak (Dark Side)
• Craig McAnulla (Onion)
• John Maslen (CluSTr)
• Beat Ramseier (Method Archive)
• Sarah Hunter (Management)
• integr8
• Paul Kersey (CluSTr)
• A Team
• Tracy Mumford
• Kerry Smith
Thank you