SlideShare a Scribd company logo
Combining large-scale evolutionary analyses
with multiple biological data sources to predict
human protein function


                              David Jones
   UCL Depts. of Computer Science and Structural and Molecular Biology
Background
In Uniprot, 30% of human    … and only 0.5% have
proteins still have no      completely specific ones
functional annotations at   for all aspects
all

         CC                            MF




 30%
                     MF           BP

                                        CC

         BP
Main approaches for function annotation
• Annotation transfers by homology
  e.g. BLAST, HMMER
  Only applicable to a subset of the data
  Has reached a plateau in terms of novel function
    annotation but provides highest quality information


• Model-classifier based using sequence features
  Limited to common and broad functions for which there
    are many examples
FFPRED - Function Prediction Pipeline
Novel sequence              Amino acid sequence



Characteristics
              structure disorder   aa transmem motifs localisation



 Classification                    GO Term
                                    SVM




                          posterior probability
                               estimate
Going further – computing gene function
from multiple data sources
• FFPRED is a currently available server for human
  (and vertebrate) proteins

• It works well but is limited to predicting only the
  functional classes that it was trained to recognize
• Extending the library requires time consuming
  training of new SVM models
• It also cannot be applied to rare functional classes
  due to limited training sets
Desirable features of a new approach


• Able to annotate all sequences

• Able to predict rare functions

• Able to offer something more than simple
  homology-based approaches

• Amenable to easy and quick updating
FunctionSpace Data Sources for H. sapiens

•   Sequence similarity
•   Signal peptides and other local features
•   Predicted secondary structure
•   Transmembrane segments
•   Predicted disordered regions
•   Domain architecture patterns
•   Gene fusion information
•   Gene co-expression
•   Protein-protein interactions



        For each sequence 49,231 features were derived
Aim
To estimate the functional similarity (a.k.a. semantic distance)
between two human proteins from their sequence features
plus available high throughput data.



Protein A


                                                    Functional
                                                    Similarity
                                                      Score


Protein B
Large-scale (domain-based) evolutionary
features

• Patterns of domain occurrence can provide
  valuable functional clues

• “Deeper” homology detection allows greater
  coverage

• We make use of our in-house fold/domain
  recognition method and several public domain
  libraries
pDomTHREADER Domain Coverage

Residues             35.7% Gene3d


                                                  CATH Domain annotations
    81.6%                               7000000

    threading                           6000000

                                        5000000


Sequences                               4000000

                                        3000000

                                        2000000

                                        1000000


   64.8%         59.4% Gene3d                 0
                                                     Public domain   Threading

   threading

 37.56 % increase in domain annotations across 5.5M sequences
 ~ 1.7 million novel domain assignments over public domain data
Computational Practicalities
                                                          Legion Nodes

 5.5M Query
 sequences                              Sequence
                            2Gb         database
                                       (5.5M seqs)



                         PSIBLAST        Find
                                         matches &
                       1min – 3 hours    generate
                                         alignments


                          Store &
                        post process

“Embarrassingly parallel” application: one sequence = one job.
Ideal capacity filling task for a modern supercomputer like Legion.
Gene Fusion Events can Predict Protein-
Protein Interactions from Sequence Data
H1         3.90.850.10                                        3.60.15.10                         H2
     fumaryl aceto acetase                                beta lactamase



                               Bi-functional enzyme
                 3.90.850.10                          3.60.15.10
                                                                           Mycobacterium tuberculosis

                                                                           Mycobacterium paratuberculosis

                                                                           Mycobacterium avium
                                 Hydrolase activity




        Hydrolysis of C-N bonds                 Hydrolysis of C-C bonds
A Novel Gene Fusion Discovered using CATH
domain fusion analysis

       Phosphoglyceromutase                                   DNA repair (RAD50)

            3.40.120.10                                              3.40.50.300
   Alpha-D-Glucose-1,6-Bisphosphate               P-loop nucleotide triphosphate hydrolases


                3.40.120.10
                              Transcription coupling repair factor

                  3.40.120.10                             3.40.50.300
                                                                                   Saccharopolyspora erythraea

                                                                                   Syntrophomonas wolfei


                                   Oxidative stress




          D-glucose metabolism                                DNA repair
Novel Gene Fusion Discovery

              3.40.120.10        3.40.50.300         3.40.50.300
                                                                   Saccharopolyspora erythraea
              3.40.50.300                            3.40.120.10
                                                                   Syntrophomonas wolfei




                                                       Novel annotations



  • Rice PGM1 gene annotated as GO:0006950 response to
    stress
  • PGM3 has relationship with DNA repair sequence

   Kanazawa K, Ashida H (1991) Relationship between oxidative stress
   and hepatic phosphoglucomutase activity in rats. Int J Tissue React 13: 225
Domain based features

 Score
 architectures




 Score
 complexes




 7960 features
 11210 features
Fusion scoring




  Each domain is a feature, score has 2 components

  1. Prediction quality (logistic transform of feature)
  2. Promiscuity weight related to the number of times the sequence
     occurs as part of a fused product wi = log fus
                                             i
Integration of “External” Features:
Microarray Expression Data
                                   Gene   Gene                                       14
                                    A      B




                                                               Probe Signal (log2)
                                                                                     12
  Normalised Microarray Datasets




                                                                                     10

                                                                                      8

                                                                                      6

                                                                                      4

                                                                                      2

                                                                                      0
                                                                                          1   2   3   4   5   6   7   8   9   10 11 12 13 14 15 16

                                                                                                      Experiment (conditions)




                                                 Pearson Correlation (R)
Biclustering Microarray Expression Data




           Zinc binding sequences              A set of transcription factors
           global correlation 0.42                global correlation 0.48


23912 features generated from biclustering of 2346 publicly available microarrays
                   (81 experiments) using BIMAX algorithm
FunctionSpace: Two-stage Integration of Data


                       SVMsw

                       SVMloc

                       SVMss

             Feature
Protein A    vectors
                       SVMtm

                       SVMdis
                                         Functional
                       SVMdpc   SVMfsc   Similarity
                       SVMgfc
                                           Score

                       SVMdpp

             Feature   SVMgfp
Protein B    vectors
                       SVMge

                       SVMppi
A 3-D Projection of Annotated Human Proteins

• 49,231 dimensions first
  reduced to 11 dimensions by
  SVM regression with 11
  different groups of features
• Each protein is here
  represented as a point in this
  derived 11-D feature space
  projected into 3-D
• Colouring is according to
  functional similarity which
  shows that proteins with similar
  functions (warmer colours)
  cluster strongly in this space
• 75% of nearest neighbour pairs
  share common GO terms
Individual Feature Contributions
  Matthews Correlation Coefficient
Function Annotation Results for 20674
Unannotated IPI Human Sequences




 Each sequence is classed “Easy”, “Medium” or “Hard” depending on
 degree of homology to functionally annotated proteins in UNIPROT.
Preliminary Results
In 2009 FunctionSpace produced GO term predictions for 19678 IPI
uncharacterized human sequences. 2746 have been annotated since.
                  MF              Measure             BP
                 16%          % Exact Matches         9%
                  -1.3 Mean semantic distance -1.7




       Less             More                     Less       More
      specific         specific                 specific   specific
Initial considerations for CAFA


•   50,000 sequences
•   11 eukaryotic & 7 prokaryotic species
•   High specificity annotations needed
•   Partial descriptive text already in Swiss-Prot/Uniprot for some
    entries

• FFPRED/FunctionSpace would not be enough

• Need to incorporate textual information from databases
  and comprehensive homology(orthology)-derived labels

• Need to get all this working in a few months!
Best Laid Plans for CAFA


• Plan A
   – Build separate annotation pipelines for missing data
   – Calibrate each pipeline according to precision values derived from
     benchmark on 500 highly annotated Swiss-Prot entries
   – Combine pipeline annotations using high-level classifier (SVM or Naive
     Bayes)


• Plan B
   – No time to build high-level classifier!
   – Combine annotation sources using heuristic graphical approach

• Hope for the best!
  (and expect the worst...)
GO term prediction from Swiss-Prot
 text-mining

• For targets which already had
  descriptive text, keywords or
  comments in Swiss-Prot, GO terms
  were assigned using a naive Bayes
  text-classification approach
• Single words and groups of 2 and 3
  words were counted
• Words occurring in different Swiss-Prot
  record types were distinguished in the
  analysis, and some simple pre-parsing
  of feature (FT) records was carried out
  in addition.
Homology-based annotation sources

• PSI-BLAST searches against Uniprot
    – Low E-value threshold to ensure close homologues used for
      annotation transfer
    – Alignment length threshold to avoid domain problem
• Transfer of annotations from orthologues
    – EggNOG 2.0
    – More reliable GO term transfer than for PSI-BLAST but lower
      coverage
• Profile-profile searches against Swiss-Prot
    – Low reliability transfer from very distant homologues
    – Improves coverage where needed (at expense of specificity)
Heuristic back-propagation of precision
estimates


                                 Back-propagation
                                 repeated for each
                                 annotation source
       Back-propagation
                                 to define a
       of precision
                                 consensus for
       estimates
                                 each node
   P’ = 1 - (1 – P) (1 – Q)
Final steps
• After back-propagation, all referenced GO terms
  are ranked according to final confidence scores

• To reduce conflicting annotations, pairs of terms
  with zero observed co-occurrence frequency in
  GOA are subjected to pairwise tournament
  selection.

• Results submitted to server using the
  mouse-window-cut-paste-click-submit
  algorithm
CASP vs CAFA from a Predictor’s Point of
View
• Number of targets
   – Manual vs automated approaches
• Difficulty of targets
   – A major limit in driving CASP forwards
• Assessment
   – Hard to pre-judge impact of decisions made during
     prediction season
• Tools for the community
   – Standards and methods in CASP have been very useful
• Getting the word out to the wider community
Acknowledgements

 Anna Lobley
 Domenico Cozzetto
 Daniel Buchan



 Kevin Bryson
 Christine Orengo

More Related Content

What's hot

Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
Integrated DNA Technologies
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
Aureliano Bombarely
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
mikaelhuss
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
Nikolay Vyahhi
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
Thomas Keane
 
Suman (2)
Suman (2)Suman (2)
Suman (2)
Suman Tripatthi
 
subtractive hybridization
subtractive hybridizationsubtractive hybridization
subtractive hybridization
Sakshi Saxena
 
Thesis
ThesisThesis
Thesis
Stefan Loska
 
Genome editing & targeting tools
Genome editing & targeting toolsGenome editing & targeting tools
Genome editing & targeting tools
S Rasouli
 
Simultaneious monitoring of phosphorylation events and protein protein intera...
Simultaneious monitoring of phosphorylation events and protein protein intera...Simultaneious monitoring of phosphorylation events and protein protein intera...
Simultaneious monitoring of phosphorylation events and protein protein intera...
PerkinElmer, Inc.
 
Cpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexesCpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexes
Integrated DNA Technologies
 
3D-Screen Technology overview
3D-Screen Technology overview3D-Screen Technology overview
3D-Screen Technology overview
pguedat
 
J.1747 0285.2009.00940.x
J.1747 0285.2009.00940.xJ.1747 0285.2009.00940.x
J.1747 0285.2009.00940.x
daisydew
 
UIowa 2005 - Iowa City, IA
UIowa 2005 - Iowa City, IAUIowa 2005 - Iowa City, IA
UIowa 2005 - Iowa City, IA
Randy Simpson
 
Protein Engineering
Protein EngineeringProtein Engineering
Protein Engineering
Anupkumar Sharma
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
COST action BM1006
 
Protein engineering and its techniques himanshu
Protein engineering and its techniques himanshuProtein engineering and its techniques himanshu
Protein engineering and its techniques himanshu
himanshu kamboj
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
Ramya P
 
Getting started with CRISPR: a review of gene knockout and homology-directed ...
Getting started with CRISPR: a review of gene knockout and homology-directed ...Getting started with CRISPR: a review of gene knockout and homology-directed ...
Getting started with CRISPR: a review of gene knockout and homology-directed ...
Integrated DNA Technologies
 
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Integrated DNA Technologies
 

What's hot (20)

Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Suman (2)
Suman (2)Suman (2)
Suman (2)
 
subtractive hybridization
subtractive hybridizationsubtractive hybridization
subtractive hybridization
 
Thesis
ThesisThesis
Thesis
 
Genome editing & targeting tools
Genome editing & targeting toolsGenome editing & targeting tools
Genome editing & targeting tools
 
Simultaneious monitoring of phosphorylation events and protein protein intera...
Simultaneious monitoring of phosphorylation events and protein protein intera...Simultaneious monitoring of phosphorylation events and protein protein intera...
Simultaneious monitoring of phosphorylation events and protein protein intera...
 
Cpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexesCpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexes
 
3D-Screen Technology overview
3D-Screen Technology overview3D-Screen Technology overview
3D-Screen Technology overview
 
J.1747 0285.2009.00940.x
J.1747 0285.2009.00940.xJ.1747 0285.2009.00940.x
J.1747 0285.2009.00940.x
 
UIowa 2005 - Iowa City, IA
UIowa 2005 - Iowa City, IAUIowa 2005 - Iowa City, IA
UIowa 2005 - Iowa City, IA
 
Protein Engineering
Protein EngineeringProtein Engineering
Protein Engineering
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
Protein engineering and its techniques himanshu
Protein engineering and its techniques himanshuProtein engineering and its techniques himanshu
Protein engineering and its techniques himanshu
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Getting started with CRISPR: a review of gene knockout and homology-directed ...
Getting started with CRISPR: a review of gene knockout and homology-directed ...Getting started with CRISPR: a review of gene knockout and homology-directed ...
Getting started with CRISPR: a review of gene knockout and homology-directed ...
 
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
 

Viewers also liked

Biotech Clusters - Nutopya
Biotech Clusters - NutopyaBiotech Clusters - Nutopya
Biotech Clusters - Nutopya
Nutopya Life Science
 
Polt Presentation Priority Setting Vienna 18 02 2010
Polt Presentation Priority Setting Vienna 18 02 2010Polt Presentation Priority Setting Vienna 18 02 2010
Polt Presentation Priority Setting Vienna 18 02 2010
Wolfgang_Polt
 
The Chinese Way of Innovation
The Chinese Way of InnovationThe Chinese Way of Innovation
The Chinese Way of Innovation
Tjitra & Associates
 
Zongshen Cyclone Fly
Zongshen Cyclone FlyZongshen Cyclone Fly
Zongshen Cyclone Fly
hi.interest
 
Ignobel2010
Ignobel2010Ignobel2010
Ignobel2010
Iddo
 
Ismb grant-writing-2012
Ismb grant-writing-2012Ismb grant-writing-2012
Ismb grant-writing-2012
Iddo
 
Go camp 2010_cacao
Go camp 2010_cacaoGo camp 2010_cacao
Go camp 2010_cacao
Iddo
 
David Jones AFP/CAFA2011
David Jones AFP/CAFA2011David Jones AFP/CAFA2011
David Jones AFP/CAFA2011
Iddo
 
Manual Vaic Mp9 T800
Manual Vaic Mp9 T800Manual Vaic Mp9 T800
Manual Vaic Mp9 T800
Psyfers
 
Jeff Grethe: CAMERA
Jeff Grethe: CAMERAJeff Grethe: CAMERA
Jeff Grethe: CAMERA
Iddo
 
Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013
Iddo
 
Portfolio 01
Portfolio 01Portfolio 01
Portfolio 01
Graphic Design
 
Katrina Photos Short Version
Katrina Photos Short VersionKatrina Photos Short Version
Katrina Photos Short Versionhomestarmy26
 
Metagenomics Biocuration 2013
Metagenomics Biocuration 2013Metagenomics Biocuration 2013
Metagenomics Biocuration 2013
Iddo
 
Afp cafa djuric
Afp cafa djuricAfp cafa djuric
Afp cafa djuric
Iddo
 
A Year In the Western English Channel
A Year In the Western English ChannelA Year In the Western English Channel
A Year In the Western English Channel
Iddo
 
Innovation in China: Zonghsen case study. Carles Debart
Innovation in China: Zonghsen case study. Carles DebartInnovation in China: Zonghsen case study. Carles Debart
Innovation in China: Zonghsen case study. Carles Debart
Carles Debart
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
gawump
 
Genome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin DiscoveryGenome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin Discovery
Iddo
 
Critical Assessment of Function Annotation, 2005
Critical Assessment of Function Annotation, 2005Critical Assessment of Function Annotation, 2005
Critical Assessment of Function Annotation, 2005
Iddo
 

Viewers also liked (20)

Biotech Clusters - Nutopya
Biotech Clusters - NutopyaBiotech Clusters - Nutopya
Biotech Clusters - Nutopya
 
Polt Presentation Priority Setting Vienna 18 02 2010
Polt Presentation Priority Setting Vienna 18 02 2010Polt Presentation Priority Setting Vienna 18 02 2010
Polt Presentation Priority Setting Vienna 18 02 2010
 
The Chinese Way of Innovation
The Chinese Way of InnovationThe Chinese Way of Innovation
The Chinese Way of Innovation
 
Zongshen Cyclone Fly
Zongshen Cyclone FlyZongshen Cyclone Fly
Zongshen Cyclone Fly
 
Ignobel2010
Ignobel2010Ignobel2010
Ignobel2010
 
Ismb grant-writing-2012
Ismb grant-writing-2012Ismb grant-writing-2012
Ismb grant-writing-2012
 
Go camp 2010_cacao
Go camp 2010_cacaoGo camp 2010_cacao
Go camp 2010_cacao
 
David Jones AFP/CAFA2011
David Jones AFP/CAFA2011David Jones AFP/CAFA2011
David Jones AFP/CAFA2011
 
Manual Vaic Mp9 T800
Manual Vaic Mp9 T800Manual Vaic Mp9 T800
Manual Vaic Mp9 T800
 
Jeff Grethe: CAMERA
Jeff Grethe: CAMERAJeff Grethe: CAMERA
Jeff Grethe: CAMERA
 
Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013
 
Portfolio 01
Portfolio 01Portfolio 01
Portfolio 01
 
Katrina Photos Short Version
Katrina Photos Short VersionKatrina Photos Short Version
Katrina Photos Short Version
 
Metagenomics Biocuration 2013
Metagenomics Biocuration 2013Metagenomics Biocuration 2013
Metagenomics Biocuration 2013
 
Afp cafa djuric
Afp cafa djuricAfp cafa djuric
Afp cafa djuric
 
A Year In the Western English Channel
A Year In the Western English ChannelA Year In the Western English Channel
A Year In the Western English Channel
 
Innovation in China: Zonghsen case study. Carles Debart
Innovation in China: Zonghsen case study. Carles DebartInnovation in China: Zonghsen case study. Carles Debart
Innovation in China: Zonghsen case study. Carles Debart
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 
Genome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin DiscoveryGenome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin Discovery
 
Critical Assessment of Function Annotation, 2005
Critical Assessment of Function Annotation, 2005Critical Assessment of Function Annotation, 2005
Critical Assessment of Function Annotation, 2005
 

Similar to Vienna afp2011

Microarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarraysMicroarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarrays
ayeshasattarsandhu
 
Protein function and bioinformatics
Protein function and bioinformaticsProtein function and bioinformatics
Protein function and bioinformatics
Neil Saunders
 
Machine Learning
Machine LearningMachine Learning
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Copenhagenomics
 
From sample-to-spray: high performance workflow for top down protein analysis
From sample-to-spray: high performance workflow for top down protein analysisFrom sample-to-spray: high performance workflow for top down protein analysis
From sample-to-spray: high performance workflow for top down protein analysis
Expedeon
 
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
Mark Berjanskii
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
Sean Ekins
 
Using ontologies to do integrative systems biology
Using ontologies to do integrative systems biologyUsing ontologies to do integrative systems biology
Using ontologies to do integrative systems biology
Chris Evelo
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomes
Surya Saha
 
Discovering drugs (I. Belda)
Discovering drugs (I. Belda)Discovering drugs (I. Belda)
Thesis def
Thesis defThesis def
Thesis def
Jay Vyas
 
HPLC2005
HPLC2005HPLC2005
HPLC2005
Julio Garcia
 
GeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein productionGeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein production
Thermo Fisher Scientific
 
HIV Vaccines Process Development & Manufacturing - Pitfalls & Possibilities
HIV Vaccines Process Development & Manufacturing - Pitfalls & PossibilitiesHIV Vaccines Process Development & Manufacturing - Pitfalls & Possibilities
HIV Vaccines Process Development & Manufacturing - Pitfalls & Possibilities
KBI Biopharma
 
Seminar 20150920.2
Seminar 20150920.2Seminar 20150920.2
Seminar 20150920.2
Christopher Day
 
20081216 05袁國芳 紅麴菌基因體計畫及基因研究
20081216 05袁國芳 紅麴菌基因體計畫及基因研究20081216 05袁國芳 紅麴菌基因體計畫及基因研究
20081216 05袁國芳 紅麴菌基因體計畫及基因研究
Monascus2008
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
GenomeInABottle
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
hansjansen9999
 
PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...
Valerie Wood
 
GMOD 2014 MAKER Lecture
GMOD 2014 MAKER LectureGMOD 2014 MAKER Lecture
GMOD 2014 MAKER Lecture
barrymoore
 

Similar to Vienna afp2011 (20)

Microarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarraysMicroarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarrays
 
Protein function and bioinformatics
Protein function and bioinformaticsProtein function and bioinformatics
Protein function and bioinformatics
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
 
From sample-to-spray: high performance workflow for top down protein analysis
From sample-to-spray: high performance workflow for top down protein analysisFrom sample-to-spray: high performance workflow for top down protein analysis
From sample-to-spray: high performance workflow for top down protein analysis
 
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
 
Using ontologies to do integrative systems biology
Using ontologies to do integrative systems biologyUsing ontologies to do integrative systems biology
Using ontologies to do integrative systems biology
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomes
 
Discovering drugs (I. Belda)
Discovering drugs (I. Belda)Discovering drugs (I. Belda)
Discovering drugs (I. Belda)
 
Thesis def
Thesis defThesis def
Thesis def
 
HPLC2005
HPLC2005HPLC2005
HPLC2005
 
GeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein productionGeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein production
 
HIV Vaccines Process Development & Manufacturing - Pitfalls & Possibilities
HIV Vaccines Process Development & Manufacturing - Pitfalls & PossibilitiesHIV Vaccines Process Development & Manufacturing - Pitfalls & Possibilities
HIV Vaccines Process Development & Manufacturing - Pitfalls & Possibilities
 
Seminar 20150920.2
Seminar 20150920.2Seminar 20150920.2
Seminar 20150920.2
 
20081216 05袁國芳 紅麴菌基因體計畫及基因研究
20081216 05袁國芳 紅麴菌基因體計畫及基因研究20081216 05袁國芳 紅麴菌基因體計畫及基因研究
20081216 05袁國芳 紅麴菌基因體計畫及基因研究
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...
 
GMOD 2014 MAKER Lecture
GMOD 2014 MAKER LectureGMOD 2014 MAKER Lecture
GMOD 2014 MAKER Lecture
 

More from Iddo

What can Community Challenges do for You?
What can Community Challenges do for You?What can Community Challenges do for You?
What can Community Challenges do for You?
Iddo
 
Surviving Scientific Presentations
Surviving Scientific PresentationsSurviving Scientific Presentations
Surviving Scientific Presentations
Iddo
 
Friedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nrFriedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nr
Iddo
 
The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...
Iddo
 
Why Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongWhy Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is Wrong
Iddo
 
Tracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in BacteriaTracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in Bacteria
Iddo
 
Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...
Iddo
 
Friedberg lab-overview-grad-students
Friedberg lab-overview-grad-studentsFriedberg lab-overview-grad-students
Friedberg lab-overview-grad-students
Iddo
 
Understanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low OutputUnderstanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low Output
Iddo
 
Random Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in ScienceRandom Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in Science
Iddo
 
Convergent divergent
Convergent divergentConvergent divergent
Convergent divergent
Iddo
 
Some US Science Funding sources
Some US Science Funding sourcesSome US Science Funding sources
Some US Science Funding sources
Iddo
 
CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013
Iddo
 

More from Iddo (13)

What can Community Challenges do for You?
What can Community Challenges do for You?What can Community Challenges do for You?
What can Community Challenges do for You?
 
Surviving Scientific Presentations
Surviving Scientific PresentationsSurviving Scientific Presentations
Surviving Scientific Presentations
 
Friedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nrFriedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nr
 
The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...
 
Why Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongWhy Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is Wrong
 
Tracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in BacteriaTracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in Bacteria
 
Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...
 
Friedberg lab-overview-grad-students
Friedberg lab-overview-grad-studentsFriedberg lab-overview-grad-students
Friedberg lab-overview-grad-students
 
Understanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low OutputUnderstanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low Output
 
Random Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in ScienceRandom Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in Science
 
Convergent divergent
Convergent divergentConvergent divergent
Convergent divergent
 
Some US Science Funding sources
Some US Science Funding sourcesSome US Science Funding sources
Some US Science Funding sources
 
CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013
 

Recently uploaded

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 

Recently uploaded (20)

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 

Vienna afp2011

  • 1. Combining large-scale evolutionary analyses with multiple biological data sources to predict human protein function David Jones UCL Depts. of Computer Science and Structural and Molecular Biology
  • 2. Background In Uniprot, 30% of human … and only 0.5% have proteins still have no completely specific ones functional annotations at for all aspects all CC MF 30% MF BP CC BP
  • 3. Main approaches for function annotation • Annotation transfers by homology e.g. BLAST, HMMER Only applicable to a subset of the data Has reached a plateau in terms of novel function annotation but provides highest quality information • Model-classifier based using sequence features Limited to common and broad functions for which there are many examples
  • 4. FFPRED - Function Prediction Pipeline Novel sequence Amino acid sequence Characteristics structure disorder aa transmem motifs localisation Classification GO Term SVM posterior probability estimate
  • 5.
  • 6.
  • 7. Going further – computing gene function from multiple data sources • FFPRED is a currently available server for human (and vertebrate) proteins • It works well but is limited to predicting only the functional classes that it was trained to recognize • Extending the library requires time consuming training of new SVM models • It also cannot be applied to rare functional classes due to limited training sets
  • 8. Desirable features of a new approach • Able to annotate all sequences • Able to predict rare functions • Able to offer something more than simple homology-based approaches • Amenable to easy and quick updating
  • 9. FunctionSpace Data Sources for H. sapiens • Sequence similarity • Signal peptides and other local features • Predicted secondary structure • Transmembrane segments • Predicted disordered regions • Domain architecture patterns • Gene fusion information • Gene co-expression • Protein-protein interactions For each sequence 49,231 features were derived
  • 10. Aim To estimate the functional similarity (a.k.a. semantic distance) between two human proteins from their sequence features plus available high throughput data. Protein A Functional Similarity Score Protein B
  • 11. Large-scale (domain-based) evolutionary features • Patterns of domain occurrence can provide valuable functional clues • “Deeper” homology detection allows greater coverage • We make use of our in-house fold/domain recognition method and several public domain libraries
  • 12. pDomTHREADER Domain Coverage Residues 35.7% Gene3d CATH Domain annotations 81.6% 7000000 threading 6000000 5000000 Sequences 4000000 3000000 2000000 1000000 64.8% 59.4% Gene3d 0 Public domain Threading threading 37.56 % increase in domain annotations across 5.5M sequences ~ 1.7 million novel domain assignments over public domain data
  • 13. Computational Practicalities Legion Nodes 5.5M Query sequences Sequence 2Gb database (5.5M seqs) PSIBLAST Find matches & 1min – 3 hours generate alignments Store & post process “Embarrassingly parallel” application: one sequence = one job. Ideal capacity filling task for a modern supercomputer like Legion.
  • 14. Gene Fusion Events can Predict Protein- Protein Interactions from Sequence Data H1 3.90.850.10 3.60.15.10 H2 fumaryl aceto acetase beta lactamase Bi-functional enzyme 3.90.850.10 3.60.15.10 Mycobacterium tuberculosis Mycobacterium paratuberculosis Mycobacterium avium Hydrolase activity Hydrolysis of C-N bonds Hydrolysis of C-C bonds
  • 15. A Novel Gene Fusion Discovered using CATH domain fusion analysis Phosphoglyceromutase DNA repair (RAD50) 3.40.120.10 3.40.50.300 Alpha-D-Glucose-1,6-Bisphosphate P-loop nucleotide triphosphate hydrolases 3.40.120.10 Transcription coupling repair factor 3.40.120.10 3.40.50.300 Saccharopolyspora erythraea Syntrophomonas wolfei Oxidative stress D-glucose metabolism DNA repair
  • 16. Novel Gene Fusion Discovery 3.40.120.10 3.40.50.300 3.40.50.300 Saccharopolyspora erythraea 3.40.50.300 3.40.120.10 Syntrophomonas wolfei Novel annotations • Rice PGM1 gene annotated as GO:0006950 response to stress • PGM3 has relationship with DNA repair sequence Kanazawa K, Ashida H (1991) Relationship between oxidative stress and hepatic phosphoglucomutase activity in rats. Int J Tissue React 13: 225
  • 17. Domain based features Score architectures Score complexes 7960 features 11210 features
  • 18. Fusion scoring Each domain is a feature, score has 2 components 1. Prediction quality (logistic transform of feature) 2. Promiscuity weight related to the number of times the sequence occurs as part of a fused product wi = log fus i
  • 19. Integration of “External” Features: Microarray Expression Data Gene Gene 14 A B Probe Signal (log2) 12 Normalised Microarray Datasets 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Experiment (conditions) Pearson Correlation (R)
  • 20. Biclustering Microarray Expression Data Zinc binding sequences A set of transcription factors global correlation 0.42 global correlation 0.48 23912 features generated from biclustering of 2346 publicly available microarrays (81 experiments) using BIMAX algorithm
  • 21. FunctionSpace: Two-stage Integration of Data SVMsw SVMloc SVMss Feature Protein A vectors SVMtm SVMdis Functional SVMdpc SVMfsc Similarity SVMgfc Score SVMdpp Feature SVMgfp Protein B vectors SVMge SVMppi
  • 22. A 3-D Projection of Annotated Human Proteins • 49,231 dimensions first reduced to 11 dimensions by SVM regression with 11 different groups of features • Each protein is here represented as a point in this derived 11-D feature space projected into 3-D • Colouring is according to functional similarity which shows that proteins with similar functions (warmer colours) cluster strongly in this space • 75% of nearest neighbour pairs share common GO terms
  • 23. Individual Feature Contributions Matthews Correlation Coefficient
  • 24. Function Annotation Results for 20674 Unannotated IPI Human Sequences Each sequence is classed “Easy”, “Medium” or “Hard” depending on degree of homology to functionally annotated proteins in UNIPROT.
  • 25. Preliminary Results In 2009 FunctionSpace produced GO term predictions for 19678 IPI uncharacterized human sequences. 2746 have been annotated since. MF Measure BP 16% % Exact Matches 9% -1.3 Mean semantic distance -1.7 Less More Less More specific specific specific specific
  • 26. Initial considerations for CAFA • 50,000 sequences • 11 eukaryotic & 7 prokaryotic species • High specificity annotations needed • Partial descriptive text already in Swiss-Prot/Uniprot for some entries • FFPRED/FunctionSpace would not be enough • Need to incorporate textual information from databases and comprehensive homology(orthology)-derived labels • Need to get all this working in a few months!
  • 27. Best Laid Plans for CAFA • Plan A – Build separate annotation pipelines for missing data – Calibrate each pipeline according to precision values derived from benchmark on 500 highly annotated Swiss-Prot entries – Combine pipeline annotations using high-level classifier (SVM or Naive Bayes) • Plan B – No time to build high-level classifier! – Combine annotation sources using heuristic graphical approach • Hope for the best! (and expect the worst...)
  • 28. GO term prediction from Swiss-Prot text-mining • For targets which already had descriptive text, keywords or comments in Swiss-Prot, GO terms were assigned using a naive Bayes text-classification approach • Single words and groups of 2 and 3 words were counted • Words occurring in different Swiss-Prot record types were distinguished in the analysis, and some simple pre-parsing of feature (FT) records was carried out in addition.
  • 29. Homology-based annotation sources • PSI-BLAST searches against Uniprot – Low E-value threshold to ensure close homologues used for annotation transfer – Alignment length threshold to avoid domain problem • Transfer of annotations from orthologues – EggNOG 2.0 – More reliable GO term transfer than for PSI-BLAST but lower coverage • Profile-profile searches against Swiss-Prot – Low reliability transfer from very distant homologues – Improves coverage where needed (at expense of specificity)
  • 30. Heuristic back-propagation of precision estimates Back-propagation repeated for each annotation source Back-propagation to define a of precision consensus for estimates each node P’ = 1 - (1 – P) (1 – Q)
  • 31. Final steps • After back-propagation, all referenced GO terms are ranked according to final confidence scores • To reduce conflicting annotations, pairs of terms with zero observed co-occurrence frequency in GOA are subjected to pairwise tournament selection. • Results submitted to server using the mouse-window-cut-paste-click-submit algorithm
  • 32. CASP vs CAFA from a Predictor’s Point of View • Number of targets – Manual vs automated approaches • Difficulty of targets – A major limit in driving CASP forwards • Assessment – Hard to pre-judge impact of decisions made during prediction season • Tools for the community – Standards and methods in CASP have been very useful • Getting the word out to the wider community
  • 33. Acknowledgements Anna Lobley Domenico Cozzetto Daniel Buchan Kevin Bryson Christine Orengo