SlideShare a Scribd company logo
1 of 26
Download to read offline
EBI is an Outstation of the European Molecular Biology Laboratory. 
9 April 2008
InterPro data pipelines
Antony Quinn
Outline
• Onion
• Main InterPro pipeline: predict protein families, domains, sites
• A lot
• CluSTr
• Automatic classification of proteins based on sequence similarity
• A little
InterPro data pipelines: What‘s in it for me?9 April 2008
Not this...
InterPro data pipelines: What‘s in it for me?9 April 2008
InterPro data pipelines: What‘s in it for me?9 April 2008
Mission: to explore strange new proteins…
Onion + Protein Sequence
=
Prediction of Functional Annotation
InterPro
Protein families, domains, repeats and sites
InterPro data pipelines: What‘s in it for me?9 April 2008
Requirements
• Handle all member databases and algorithms
• HMMER (eg. Gene3D, PANTHER)
• Regular expressions (PROSITE)
• SignalP
• TMHMM
• BLAST (PIRSF)
• FingerPRINTScan (PRINTS)
• Fast
• Wide and deep coverage
InterPro data pipelines: What‘s in it for me?9 April 2008
Design
• UniParc
• Solves mapping problem
• Sequential IDs
• Comprehensive – many DBs, all sequences
• Method Archive
• Minimise calculations
• Read flat files once
• Decoupled analysis and post-processing
InterPro data pipelines: What‘s in it for me?9 April 2008
The Trinity
Onion
UniParcMethod
Archive
Member
database
methods
Protein
sequences
InterPro data pipelines: What‘s in it for me?9 April 2008
New sequences
UniParc
Onion
New
sequences
Run
against all
methods
UniParcMethod
Archive
InterPro data pipelines: What‘s in it for me?9 April 2008
Member database release
Onion
UniParc
Method
archive
Method
Archive
Run new and
changed methods
against all
sequences
Advantages:
• If only post-processing
or cut-off changed – only
run that part
• No change – no need to
rerun
Methods
added,
changed
or deleted
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
UniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
New release of model database – search new and
changed models against all of UniParc
anthill
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTA
file
1000s of
model files
anthill
UniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTA
file
1000s of
model files
anthill
bsubsumissioncmds
UniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTA
file
1000s of
model files
LSF
anthill
bsubsumissioncmds
UniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTA
file
1000s of
model files
LSF
anthill
bsubsumissioncmds
output files
(raw results)
SQL*Loader
file
Parse,
reformatUniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTA
file
1000s of
model files
LSF
anthill
bsubsumissioncmds
output files
(raw results)
SQL*Loader
file
Parse,
reformat
Load
ONION
Raw results table
UniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Deluge” mode (manual)
FASTA
file
1000s of
model files
LSF
anthill
bsubsumissioncmds
output files
(raw results)
SQL*Loader
file
Parse,
reformat
Load
ONION
Raw results table
post-
processing
Final results table
UniParc
HMM
flatfile
Profile
flatfile
FPrint
flatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Drip” mode (automatic)
UniParc
New sequences– search all models every 4 minutes
anthill
extract new
sequences
HMM
flatfile
Profile
flatfile
FPrint
flatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
“Drip” mode (automatic)
UniParc
LSF
anthill
bsubsumissioncmds
output files
(raw results)
extract new
sequences
HMM
flatfile
Profile
flatfile
FPrint
flatfile
InterPro data pipelines: What‘s in it for me?9 April 2008
UniParc
LSF
anthill
bsubsumissioncmds
output files
(raw results)
Parse,
reformat
and load
extract new
sequences
ONION
Raw results table
post-
processing
Final results table
HMM
flatfile
Profile
flatfile
FPrint
flatfile
“Drip” mode (automatic)
InterPro data pipelines: What‘s in it for me?9 April 2008
pirsf
pantherScoreassignment
HMMER
Pfam TIGRFAM SMART SUPERFAMILYGENE3D PIRSF PANTHER
GA
cut-off
TC
cut-off
E-value
cut-off
E-value
cut-off
AM filter
clan
nested
threshold
(kinase)
domainFinder
sequence
Oracle (raw data)
Oracle (refined data)
The refinery
InterPro data pipelines: What‘s in it for me?9 April 2008
Onion vs InterProScan
• Similarities
• Software: HMMER, TMHMM, SignalP
• Models: Pfam, Gene3D, PRINTS …etc
• Differences
• Internal use only
• Decoupled analysis and post-processing
• Java + database
• Faster
InterPro data pipelines: What‘s in it for me?9 April 2008
Limitations
• Database design
• Inflexible – single member DB version
• Redundant
• Tight coupling
• Internal
• Difficult to test/debug
• External
• Oracle
• LSF
• File system
InterPro data pipelines: What‘s in it for me?9 April 2008
Plans
• Merge InterProScan
• Single code base = reduced maintenance cost
• Java (Java 5? Spring? Maven?)
• Database (Oracle, Derby?, Hibernate, Java stored procs?)
• Testable
• JUnit
• Continous integration?
• API
• Java (web services?)
• Oracle: views, stored procs
InterPro data pipelines: What‘s in it for me?9 April 2008
What’s in it for me?
• UniProt curators
• On-demand sequence analysis?
• Ensembl production
• InterPro hits
• Pre- or post-UniParc?
InterPro data pipelines: What‘s in it for me?9 April 2008
CluSTr
• Input: UniProtKB, IPI, Ensembl Human – 6 million sequences
• Output:
• Similiarity scores (Smith-Waterman) – 3.5 billion
• Clusters (single linkage, aka nearest neighbour)
• Orthologues (best reciprocal hit) – 627 species
• Every 3 weeks (UniProt cycle)
• Availability: Oracle, web app, FTP (sims + GO mappings)
• Customers
• integr8 (orthologues)
• Druggable Genome (similarities)
• Potential
• Set-based analyses
• Similarities on-demand
InterPro data pipelines: What‘s in it for me?9 April 2008
Acknowledgements
• InterPro
• Robert Petryszak (Dark Side)
• Craig McAnulla (Onion)
• John Maslen (CluSTr)
• Beat Ramseier (Method Archive)
• Sarah Hunter (Management)
• integr8
• Paul Kersey (CluSTr)
• A Team
• Tracy Mumford
• Kerry Smith
Thank you

More Related Content

What's hot

ICAR 2015 Poster - Araport
ICAR 2015 Poster - AraportICAR 2015 Poster - Araport
ICAR 2015 Poster - AraportAraport
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOAEBI
 
Protein information resource (PIR)
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)ShivaniShewale2
 
Uni prot presentation
Uni prot presentationUni prot presentation
Uni prot presentationRida Khalid
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEPrashantSharma807
 
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...ExternalEvents
 
Biological data bioinformatics
Biological data bioinformatics Biological data bioinformatics
Biological data bioinformatics AakifahAmreen
 
A guided tour of Araport
A guided tour of AraportA guided tour of Araport
A guided tour of AraportAraport
 
2016 Summer - Araport Project Overview Leaflet
2016 Summer - Araport Project Overview Leaflet2016 Summer - Araport Project Overview Leaflet
2016 Summer - Araport Project Overview LeafletAraport
 
PROTEIN STRUCTURE DATABANK
PROTEIN STRUCTURE DATABANKPROTEIN STRUCTURE DATABANK
PROTEIN STRUCTURE DATABANKMalvika Bansal
 
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...Araport
 

What's hot (20)

ICAR 2015 Poster - Araport
ICAR 2015 Poster - AraportICAR 2015 Poster - Araport
ICAR 2015 Poster - Araport
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
Prosite
PrositeProsite
Prosite
 
Protein databases
Protein databasesProtein databases
Protein databases
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
 
PIR & MINT
PIR & MINTPIR & MINT
PIR & MINT
 
Databases
DatabasesDatabases
Databases
 
Protein information resource (PIR)
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)
 
EMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology LaboratoryEMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology Laboratory
 
Uni prot presentation
Uni prot presentationUni prot presentation
Uni prot presentation
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...
 
Biological data bioinformatics
Biological data bioinformatics Biological data bioinformatics
Biological data bioinformatics
 
A guided tour of Araport
A guided tour of AraportA guided tour of Araport
A guided tour of Araport
 
Protein database
Protein  databaseProtein  database
Protein database
 
2016 Summer - Araport Project Overview Leaflet
2016 Summer - Araport Project Overview Leaflet2016 Summer - Araport Project Overview Leaflet
2016 Summer - Araport Project Overview Leaflet
 
PROTEIN STRUCTURE DATABANK
PROTEIN STRUCTURE DATABANKPROTEIN STRUCTURE DATABANK
PROTEIN STRUCTURE DATABANK
 
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...
 

Similar to Bioinformatics Data Analysis: InterPro

Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Jian Qin
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Ben Busby
 
Plank
PlankPlank
PlankFNian
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
Lucene, Solr and java 9 - opportunities and challenges
Lucene, Solr and java 9 - opportunities and challengesLucene, Solr and java 9 - opportunities and challenges
Lucene, Solr and java 9 - opportunities and challengesCharlie Hull
 
The eNanoMapper database for nanomaterial safety information: storage and query
The eNanoMapper database for nanomaterial safety information: storage and queryThe eNanoMapper database for nanomaterial safety information: storage and query
The eNanoMapper database for nanomaterial safety information: storage and queryNina Jeliazkova
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
 
Spectra Logic's BlackPearl Developers Summit 2016
Spectra Logic's BlackPearl Developers Summit 2016Spectra Logic's BlackPearl Developers Summit 2016
Spectra Logic's BlackPearl Developers Summit 2016spectralogic
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
 
Wireshark Network Protocol Analyzer
Wireshark Network Protocol AnalyzerWireshark Network Protocol Analyzer
Wireshark Network Protocol AnalyzerJim Gilsinn
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartAraport
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformaticianChristian Frech
 
Rama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/LRama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/Lmsramakrishna
 
Friedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsFriedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsBOSC 2010
 
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case StudyPLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case StudyPROIDEA
 

Similar to Bioinformatics Data Analysis: InterPro (20)

Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
Plank
PlankPlank
Plank
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Lucene, Solr and java 9 - opportunities and challenges
Lucene, Solr and java 9 - opportunities and challengesLucene, Solr and java 9 - opportunities and challenges
Lucene, Solr and java 9 - opportunities and challenges
 
The eNanoMapper database for nanomaterial safety information: storage and query
The eNanoMapper database for nanomaterial safety information: storage and queryThe eNanoMapper database for nanomaterial safety information: storage and query
The eNanoMapper database for nanomaterial safety information: storage and query
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Data integration
Data integrationData integration
Data integration
 
Spectra Logic's BlackPearl Developers Summit 2016
Spectra Logic's BlackPearl Developers Summit 2016Spectra Logic's BlackPearl Developers Summit 2016
Spectra Logic's BlackPearl Developers Summit 2016
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
Py tables
Py tablesPy tables
Py tables
 
PyTables
PyTablesPyTables
PyTables
 
Wireshark Network Protocol Analyzer
Wireshark Network Protocol AnalyzerWireshark Network Protocol Analyzer
Wireshark Network Protocol Analyzer
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick Provart
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
 
Rama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/LRama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/L
 
Friedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsFriedberg bosc2010 iprstats
Friedberg bosc2010 iprstats
 
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case StudyPLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
PLNOG 6: Piotr Modzelewski, Bartłomiej Rymarski - Product Catalogue - Case Study
 

More from Antony Quinn

DNA Disco: Shake your booty, save the panda
DNA Disco: Shake your booty, save the pandaDNA Disco: Shake your booty, save the panda
DNA Disco: Shake your booty, save the pandaAntony Quinn
 
Careers in Bioinformatics: Life as a Software Engineer
Careers in Bioinformatics: Life as a Software EngineerCareers in Bioinformatics: Life as a Software Engineer
Careers in Bioinformatics: Life as a Software EngineerAntony Quinn
 
Bioinformatics UX Design: InterPro
Bioinformatics UX Design: InterProBioinformatics UX Design: InterPro
Bioinformatics UX Design: InterProAntony Quinn
 
Java Design Patterns: The State Pattern
Java Design Patterns: The State PatternJava Design Patterns: The State Pattern
Java Design Patterns: The State PatternAntony Quinn
 
Food Waste Hero: the Internet of Things Meets Behavioral Economics in School
Food Waste Hero: the Internet of Things Meets Behavioral Economics in SchoolFood Waste Hero: the Internet of Things Meets Behavioral Economics in School
Food Waste Hero: the Internet of Things Meets Behavioral Economics in SchoolAntony Quinn
 

More from Antony Quinn (7)

DNA Disco: Shake your booty, save the panda
DNA Disco: Shake your booty, save the pandaDNA Disco: Shake your booty, save the panda
DNA Disco: Shake your booty, save the panda
 
DNA Disco
DNA DiscoDNA Disco
DNA Disco
 
Careers in Bioinformatics: Life as a Software Engineer
Careers in Bioinformatics: Life as a Software EngineerCareers in Bioinformatics: Life as a Software Engineer
Careers in Bioinformatics: Life as a Software Engineer
 
Bioinformatics UX Design: InterPro
Bioinformatics UX Design: InterProBioinformatics UX Design: InterPro
Bioinformatics UX Design: InterPro
 
Java Design Patterns: The State Pattern
Java Design Patterns: The State PatternJava Design Patterns: The State Pattern
Java Design Patterns: The State Pattern
 
Food Waste Hero: the Internet of Things Meets Behavioral Economics in School
Food Waste Hero: the Internet of Things Meets Behavioral Economics in SchoolFood Waste Hero: the Internet of Things Meets Behavioral Economics in School
Food Waste Hero: the Internet of Things Meets Behavioral Economics in School
 
Food Waste Hero
Food Waste HeroFood Waste Hero
Food Waste Hero
 

Recently uploaded

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad EscortsCall girls in Ahmedabad High profile
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Recently uploaded (20)

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 

Bioinformatics Data Analysis: InterPro

  • 2. Outline • Onion • Main InterPro pipeline: predict protein families, domains, sites • A lot • CluSTr • Automatic classification of proteins based on sequence similarity • A little InterPro data pipelines: What‘s in it for me?9 April 2008
  • 3. Not this... InterPro data pipelines: What‘s in it for me?9 April 2008
  • 4. InterPro data pipelines: What‘s in it for me?9 April 2008 Mission: to explore strange new proteins… Onion + Protein Sequence = Prediction of Functional Annotation InterPro Protein families, domains, repeats and sites
  • 5. InterPro data pipelines: What‘s in it for me?9 April 2008 Requirements • Handle all member databases and algorithms • HMMER (eg. Gene3D, PANTHER) • Regular expressions (PROSITE) • SignalP • TMHMM • BLAST (PIRSF) • FingerPRINTScan (PRINTS) • Fast • Wide and deep coverage
  • 6. InterPro data pipelines: What‘s in it for me?9 April 2008 Design • UniParc • Solves mapping problem • Sequential IDs • Comprehensive – many DBs, all sequences • Method Archive • Minimise calculations • Read flat files once • Decoupled analysis and post-processing
  • 7. InterPro data pipelines: What‘s in it for me?9 April 2008 The Trinity Onion UniParcMethod Archive Member database methods Protein sequences
  • 8. InterPro data pipelines: What‘s in it for me?9 April 2008 New sequences UniParc Onion New sequences Run against all methods UniParcMethod Archive
  • 9. InterPro data pipelines: What‘s in it for me?9 April 2008 Member database release Onion UniParc Method archive Method Archive Run new and changed methods against all sequences Advantages: • If only post-processing or cut-off changed – only run that part • No change – no need to rerun Methods added, changed or deleted
  • 10. InterPro data pipelines: What‘s in it for me?9 April 2008 “Deluge” mode (manual) UniParc HMM flatfile Profile flatfile FPrint flatfile New release of model database – search new and changed models against all of UniParc anthill
  • 11. InterPro data pipelines: What‘s in it for me?9 April 2008 “Deluge” mode (manual) FASTA file 1000s of model files anthill UniParc HMM flatfile Profile flatfile FPrint flatfile
  • 12. InterPro data pipelines: What‘s in it for me?9 April 2008 “Deluge” mode (manual) FASTA file 1000s of model files anthill bsubsumissioncmds UniParc HMM flatfile Profile flatfile FPrint flatfile
  • 13. InterPro data pipelines: What‘s in it for me?9 April 2008 “Deluge” mode (manual) FASTA file 1000s of model files LSF anthill bsubsumissioncmds UniParc HMM flatfile Profile flatfile FPrint flatfile
  • 14. InterPro data pipelines: What‘s in it for me?9 April 2008 “Deluge” mode (manual) FASTA file 1000s of model files LSF anthill bsubsumissioncmds output files (raw results) SQL*Loader file Parse, reformatUniParc HMM flatfile Profile flatfile FPrint flatfile
  • 15. InterPro data pipelines: What‘s in it for me?9 April 2008 “Deluge” mode (manual) FASTA file 1000s of model files LSF anthill bsubsumissioncmds output files (raw results) SQL*Loader file Parse, reformat Load ONION Raw results table UniParc HMM flatfile Profile flatfile FPrint flatfile
  • 16. InterPro data pipelines: What‘s in it for me?9 April 2008 “Deluge” mode (manual) FASTA file 1000s of model files LSF anthill bsubsumissioncmds output files (raw results) SQL*Loader file Parse, reformat Load ONION Raw results table post- processing Final results table UniParc HMM flatfile Profile flatfile FPrint flatfile
  • 17. InterPro data pipelines: What‘s in it for me?9 April 2008 “Drip” mode (automatic) UniParc New sequences– search all models every 4 minutes anthill extract new sequences HMM flatfile Profile flatfile FPrint flatfile
  • 18. InterPro data pipelines: What‘s in it for me?9 April 2008 “Drip” mode (automatic) UniParc LSF anthill bsubsumissioncmds output files (raw results) extract new sequences HMM flatfile Profile flatfile FPrint flatfile
  • 19. InterPro data pipelines: What‘s in it for me?9 April 2008 UniParc LSF anthill bsubsumissioncmds output files (raw results) Parse, reformat and load extract new sequences ONION Raw results table post- processing Final results table HMM flatfile Profile flatfile FPrint flatfile “Drip” mode (automatic)
  • 20. InterPro data pipelines: What‘s in it for me?9 April 2008 pirsf pantherScoreassignment HMMER Pfam TIGRFAM SMART SUPERFAMILYGENE3D PIRSF PANTHER GA cut-off TC cut-off E-value cut-off E-value cut-off AM filter clan nested threshold (kinase) domainFinder sequence Oracle (raw data) Oracle (refined data) The refinery
  • 21. InterPro data pipelines: What‘s in it for me?9 April 2008 Onion vs InterProScan • Similarities • Software: HMMER, TMHMM, SignalP • Models: Pfam, Gene3D, PRINTS …etc • Differences • Internal use only • Decoupled analysis and post-processing • Java + database • Faster
  • 22. InterPro data pipelines: What‘s in it for me?9 April 2008 Limitations • Database design • Inflexible – single member DB version • Redundant • Tight coupling • Internal • Difficult to test/debug • External • Oracle • LSF • File system
  • 23. InterPro data pipelines: What‘s in it for me?9 April 2008 Plans • Merge InterProScan • Single code base = reduced maintenance cost • Java (Java 5? Spring? Maven?) • Database (Oracle, Derby?, Hibernate, Java stored procs?) • Testable • JUnit • Continous integration? • API • Java (web services?) • Oracle: views, stored procs
  • 24. InterPro data pipelines: What‘s in it for me?9 April 2008 What’s in it for me? • UniProt curators • On-demand sequence analysis? • Ensembl production • InterPro hits • Pre- or post-UniParc?
  • 25. InterPro data pipelines: What‘s in it for me?9 April 2008 CluSTr • Input: UniProtKB, IPI, Ensembl Human – 6 million sequences • Output: • Similiarity scores (Smith-Waterman) – 3.5 billion • Clusters (single linkage, aka nearest neighbour) • Orthologues (best reciprocal hit) – 627 species • Every 3 weeks (UniProt cycle) • Availability: Oracle, web app, FTP (sims + GO mappings) • Customers • integr8 (orthologues) • Druggable Genome (similarities) • Potential • Set-based analyses • Similarities on-demand
  • 26. InterPro data pipelines: What‘s in it for me?9 April 2008 Acknowledgements • InterPro • Robert Petryszak (Dark Side) • Craig McAnulla (Onion) • John Maslen (CluSTr) • Beat Ramseier (Method Archive) • Sarah Hunter (Management) • integr8 • Paul Kersey (CluSTr) • A Team • Tracy Mumford • Kerry Smith Thank you