SlideShare a Scribd company logo
Protein Function Prediction
 by Integrating Different
 Data Sources
Liang Lan, Nemanja Djuric, Yuhong Guo, Slobodan Vucetic
            Temple University, Philadelphia, USA




            AFP/CAFA, Vienna, Austria, July 15-16th, 2011
Introduction

 Protein function annotation - key challenge in
 post-genomic era
 Experimental annotation accurate, but slow and
 expensive
 Large amount of information available
 Data mining techniques can help when dealing
 with available, large-scale data sets
Our approach

 Integrate information from different sources to
 predict gene functions
 We consider following data sources:
   Protein sequence similarity
   Protein-protein interaction
   Gene expression
 Hypothesis: Including information from various
 sources results in better predictor performance
Methodology

 We use weighted k-Nearest Neighbor algorithm
 to calculate likelihood that protein p has function f
 score( p, f ) =      ∑ sim( p, p' ) ⋅ I ( f ∈ functions ( p' ))
                   p '∈       ( p)
                          k


 Simple to implement, yet competitive when
 compared to SVM
 Different sim(p, p’) can be obtained with different
 data sources
Methodology - Cont’d
 We calculated different scores using different
 sources (sequence similarity, PPI, gene
 expression) for each (p, f) pair
 Total of J gene expressions, resulted in J+2
 scores that are combined:

   score( p, f ) = w SEQ ⋅ score SEQ ( p, f ) +
                   w PPI ⋅ score PPI ( p, f ) +
                      J
                   ∑ j =1 j
                         ( w EXP ⋅ score EXP ( p, f ))
                                         j
Integrating different scores
   How to find weights wSEQ, wPPI, wjEXP?
   We considered several methods:
      Assigning the same weights to all scores
      Weight optimization by likelihood maximization
      Weight optimization by large margin approaches
   Also considered enhancing similarity scheme
   using approach from Pandev et al.*

* “Incorporating functional inter-relationships into protein function
    prediction algorithms”, BMC Bioinformatics (2009)
Max-margin approach
 Define the following optimization problem:
   Given n genes and m scores, and f(x, y) is an m x 1
   vector of scores for gene x and function y, solve:
       1
  min || w || 2 +C ∑ ∑ ξ i ( y, y )
  w ,ξ 2
                   i y∈Yi , y∈Yi

  s.t. w Τ ( f ( xi , y ) − f ( xi , y )) ≥ 1 − ξ i ( y, y ), ∀i, y ∈ Yi , y ∈ Yi
  ξ i ( y, y ) ≥ 0, ∀i, y ∈ Yi , y ∈ Yi

   where w is an m x 1 weight vector learned during
   training, and C is a regularization parameter
Experimental setup

 We focused on function prediction for human
 proteins
 Data sources:
  Sequence similarity scores for all pairs of CAFA
  proteins
  Gene expressions - 392 Affymetrix GPL96 Platform
  microarray data sets from GEO
  PPI - Physical interactions between human proteins
  listed in OPHID database
Experimental setup - Cont’d

  8,714 annotated human proteins in CAFA
  training set
  Out of those 8,714, total of 2,869 proteins
  covered by all three data sources
  For evaluation, only GO functions annotated by
  more than 10 proteins are considered
    This resulted in 240 MF and 1,123 BP GO terms
  Neighborhood size fixed to 20
Score averaging scheme

 None of the considered approaches worked
 significantly and consistently better than simple
 averaging
 As a result, we give the same weight to all 3
 data sources:

                wSEQ = wPPI = 1/3
                wj EXP= 1/(3J)
Results (average AUC)
   ver. 1 - neighbors found among only 2,869 overlapping human
   proteins
   ver. 2 - neighbors found among all 8,714 human proteins
   ver. 3 - neighbors found among all 36,924 CAFA training proteins

Data source                          MF terms            BP terms
Microarray data                       0.6442              0.6279
PPI data                              0.6283              0.6671
Protein Sequence data, ver. 1         0.7636              0.6642
Protein Sequence data, ver. 2         0.7896              0.6921
Protein Sequence data, ver. 3         0.8396              0.7537
Integrating 3 data sources, ver. 1    0.8134              0.7468
Integrating 3 data sources, ver. 2    0.8494              0.7939
Integrating 3 data sources, ver. 3    0.8788              0.8165
Discussion

 Several important conclusions arise:
   Gene expression is more useful for MF, while PPI is
   more useful for BP prediction
   Sequence similarity data is superior to both gene
   expression and PPI data
   It is beneficial to transfer functions to human proteins
   from their orthologues
   Integration of data sources improves AUC significantly
   for both MF and BP terms
Results - Cont’d
  Comparison of AUC of sequence similarity
  scores (ver. 3) and integrated scores (ver. 3) for
  each GO term
Conclusion

 Some sources are more beneficial for BP, while
 some for MF terms prediction
 Integration of different sources improves function
 prediction significantly
 Exploring new integration techniques could lead
 to even better results
Thank you!
                 Questions?


Nemanja Djuric supported by travel grant
from US Department of Energy

More Related Content

Similar to Afp cafa djuric

ANTIC-2021_paper_95.pdf
ANTIC-2021_paper_95.pdfANTIC-2021_paper_95.pdf
ANTIC-2021_paper_95.pdf
DrGRevathy
 
Project Presentation
Project PresentationProject Presentation
Project Presentationbutest
 
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
Enrico Glaab
 
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
Sara Alvarez
 
Proteomics
ProteomicsProteomics
Proteomics
ruchibioinfo
 
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Protein Distance Map Prediction based on a Nearest Neighbors ApproachProtein Distance Map Prediction based on a Nearest Neighbors Approach
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
 
Proteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsProteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data sets
Lars Juhl Jensen
 
Comparative study of artificial neural network based classification for liver...
Comparative study of artificial neural network based classification for liver...Comparative study of artificial neural network based classification for liver...
Comparative study of artificial neural network based classification for liver...
Alexander Decker
 
Drug Target Interaction (DTI) prediction (MSc. thesis)
Drug Target Interaction (DTI) prediction (MSc. thesis) Drug Target Interaction (DTI) prediction (MSc. thesis)
Drug Target Interaction (DTI) prediction (MSc. thesis)
Dimitris Papadopoulos
 
A novel optimized deep learning method for protein-protein prediction in bioi...
A novel optimized deep learning method for protein-protein prediction in bioi...A novel optimized deep learning method for protein-protein prediction in bioi...
A novel optimized deep learning method for protein-protein prediction in bioi...
IJECEIAES
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine Learning
IRJET Journal
 
Stock market analysis using ga and neural network
Stock market analysis using ga and neural networkStock market analysis using ga and neural network
Stock market analysis using ga and neural network
Amr Abd El Latief
 
working_example_poster
working_example_posterworking_example_poster
working_example_posterHuikun Zhang
 
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Keiji Takamoto
 
B017410916
B017410916B017410916
B017410916
IOSR Journals
 
B017410916
B017410916B017410916
B017410916
IOSR Journals
 
Quantitative Proteomics: From Instrument To Browser
Quantitative Proteomics: From Instrument To BrowserQuantitative Proteomics: From Instrument To Browser
Quantitative Proteomics: From Instrument To BrowserNeil Swainston
 
Chronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine LearningChronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine Learning
IJCSIS Research Publications
 

Similar to Afp cafa djuric (20)

ANTIC-2021_paper_95.pdf
ANTIC-2021_paper_95.pdfANTIC-2021_paper_95.pdf
ANTIC-2021_paper_95.pdf
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
 
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
A Critical Assessment Of Mus Musculus Gene Function Prediction Using Integrat...
 
Proteomics
ProteomicsProteomics
Proteomics
 
SecondaryStructurePredictionReport
SecondaryStructurePredictionReportSecondaryStructurePredictionReport
SecondaryStructurePredictionReport
 
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Protein Distance Map Prediction based on a Nearest Neighbors ApproachProtein Distance Map Prediction based on a Nearest Neighbors Approach
Protein Distance Map Prediction based on a Nearest Neighbors Approach
 
Proteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsProteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data sets
 
Comparative study of artificial neural network based classification for liver...
Comparative study of artificial neural network based classification for liver...Comparative study of artificial neural network based classification for liver...
Comparative study of artificial neural network based classification for liver...
 
Drug Target Interaction (DTI) prediction (MSc. thesis)
Drug Target Interaction (DTI) prediction (MSc. thesis) Drug Target Interaction (DTI) prediction (MSc. thesis)
Drug Target Interaction (DTI) prediction (MSc. thesis)
 
A novel optimized deep learning method for protein-protein prediction in bioi...
A novel optimized deep learning method for protein-protein prediction in bioi...A novel optimized deep learning method for protein-protein prediction in bioi...
A novel optimized deep learning method for protein-protein prediction in bioi...
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine Learning
 
Stock market analysis using ga and neural network
Stock market analysis using ga and neural networkStock market analysis using ga and neural network
Stock market analysis using ga and neural network
 
working_example_poster
working_example_posterworking_example_poster
working_example_poster
 
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
 
Poster_JOBIM_v4.2
Poster_JOBIM_v4.2Poster_JOBIM_v4.2
Poster_JOBIM_v4.2
 
B017410916
B017410916B017410916
B017410916
 
B017410916
B017410916B017410916
B017410916
 
Quantitative Proteomics: From Instrument To Browser
Quantitative Proteomics: From Instrument To BrowserQuantitative Proteomics: From Instrument To Browser
Quantitative Proteomics: From Instrument To Browser
 
Chronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine LearningChronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine Learning
 

More from Iddo

What can Community Challenges do for You?
What can Community Challenges do for You?What can Community Challenges do for You?
What can Community Challenges do for You?
Iddo
 
Surviving Scientific Presentations
Surviving Scientific PresentationsSurviving Scientific Presentations
Surviving Scientific Presentations
Iddo
 
Friedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nrFriedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nr
Iddo
 
The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...
Iddo
 
Why Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongWhy Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is Wrong
Iddo
 
Tracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in BacteriaTracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in Bacteria
Iddo
 
Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...
Iddo
 
Friedberg lab-overview-grad-students
Friedberg lab-overview-grad-studentsFriedberg lab-overview-grad-students
Friedberg lab-overview-grad-students
Iddo
 
Understanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low OutputUnderstanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low Output
Iddo
 
Random Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in ScienceRandom Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in Science
Iddo
 
Genome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin DiscoveryGenome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin Discovery
Iddo
 
Convergent divergent
Convergent divergentConvergent divergent
Convergent divergent
Iddo
 
Some US Science Funding sources
Some US Science Funding sourcesSome US Science Funding sources
Some US Science Funding sources
Iddo
 
CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013Iddo
 
Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013
Iddo
 
Metagenomics Biocuration 2013
Metagenomics Biocuration 2013Metagenomics Biocuration 2013
Metagenomics Biocuration 2013Iddo
 
Ismb grant-writing-2012
Ismb grant-writing-2012Ismb grant-writing-2012
Ismb grant-writing-2012Iddo
 
David Jones AFP/CAFA2011
David Jones AFP/CAFA2011David Jones AFP/CAFA2011
David Jones AFP/CAFA2011Iddo
 
Vienna afp2011
Vienna afp2011Vienna afp2011
Vienna afp2011Iddo
 
Go camp 2010_cacao
Go camp 2010_cacaoGo camp 2010_cacao
Go camp 2010_cacaoIddo
 

More from Iddo (20)

What can Community Challenges do for You?
What can Community Challenges do for You?What can Community Challenges do for You?
What can Community Challenges do for You?
 
Surviving Scientific Presentations
Surviving Scientific PresentationsSurviving Scientific Presentations
Surviving Scientific Presentations
 
Friedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nrFriedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nr
 
The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...
 
Why Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongWhy Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is Wrong
 
Tracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in BacteriaTracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in Bacteria
 
Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...
 
Friedberg lab-overview-grad-students
Friedberg lab-overview-grad-studentsFriedberg lab-overview-grad-students
Friedberg lab-overview-grad-students
 
Understanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low OutputUnderstanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low Output
 
Random Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in ScienceRandom Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in Science
 
Genome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin DiscoveryGenome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin Discovery
 
Convergent divergent
Convergent divergentConvergent divergent
Convergent divergent
 
Some US Science Funding sources
Some US Science Funding sourcesSome US Science Funding sources
Some US Science Funding sources
 
CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013
 
Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013
 
Metagenomics Biocuration 2013
Metagenomics Biocuration 2013Metagenomics Biocuration 2013
Metagenomics Biocuration 2013
 
Ismb grant-writing-2012
Ismb grant-writing-2012Ismb grant-writing-2012
Ismb grant-writing-2012
 
David Jones AFP/CAFA2011
David Jones AFP/CAFA2011David Jones AFP/CAFA2011
David Jones AFP/CAFA2011
 
Vienna afp2011
Vienna afp2011Vienna afp2011
Vienna afp2011
 
Go camp 2010_cacao
Go camp 2010_cacaoGo camp 2010_cacao
Go camp 2010_cacao
 

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 

Afp cafa djuric

  • 1. Protein Function Prediction by Integrating Different Data Sources Liang Lan, Nemanja Djuric, Yuhong Guo, Slobodan Vucetic Temple University, Philadelphia, USA AFP/CAFA, Vienna, Austria, July 15-16th, 2011
  • 2. Introduction Protein function annotation - key challenge in post-genomic era Experimental annotation accurate, but slow and expensive Large amount of information available Data mining techniques can help when dealing with available, large-scale data sets
  • 3. Our approach Integrate information from different sources to predict gene functions We consider following data sources: Protein sequence similarity Protein-protein interaction Gene expression Hypothesis: Including information from various sources results in better predictor performance
  • 4. Methodology We use weighted k-Nearest Neighbor algorithm to calculate likelihood that protein p has function f score( p, f ) = ∑ sim( p, p' ) ⋅ I ( f ∈ functions ( p' )) p '∈ ( p) k Simple to implement, yet competitive when compared to SVM Different sim(p, p’) can be obtained with different data sources
  • 5. Methodology - Cont’d We calculated different scores using different sources (sequence similarity, PPI, gene expression) for each (p, f) pair Total of J gene expressions, resulted in J+2 scores that are combined: score( p, f ) = w SEQ ⋅ score SEQ ( p, f ) + w PPI ⋅ score PPI ( p, f ) + J ∑ j =1 j ( w EXP ⋅ score EXP ( p, f )) j
  • 6. Integrating different scores How to find weights wSEQ, wPPI, wjEXP? We considered several methods: Assigning the same weights to all scores Weight optimization by likelihood maximization Weight optimization by large margin approaches Also considered enhancing similarity scheme using approach from Pandev et al.* * “Incorporating functional inter-relationships into protein function prediction algorithms”, BMC Bioinformatics (2009)
  • 7. Max-margin approach Define the following optimization problem: Given n genes and m scores, and f(x, y) is an m x 1 vector of scores for gene x and function y, solve: 1 min || w || 2 +C ∑ ∑ ξ i ( y, y ) w ,ξ 2 i y∈Yi , y∈Yi s.t. w Τ ( f ( xi , y ) − f ( xi , y )) ≥ 1 − ξ i ( y, y ), ∀i, y ∈ Yi , y ∈ Yi ξ i ( y, y ) ≥ 0, ∀i, y ∈ Yi , y ∈ Yi where w is an m x 1 weight vector learned during training, and C is a regularization parameter
  • 8. Experimental setup We focused on function prediction for human proteins Data sources: Sequence similarity scores for all pairs of CAFA proteins Gene expressions - 392 Affymetrix GPL96 Platform microarray data sets from GEO PPI - Physical interactions between human proteins listed in OPHID database
  • 9. Experimental setup - Cont’d 8,714 annotated human proteins in CAFA training set Out of those 8,714, total of 2,869 proteins covered by all three data sources For evaluation, only GO functions annotated by more than 10 proteins are considered This resulted in 240 MF and 1,123 BP GO terms Neighborhood size fixed to 20
  • 10. Score averaging scheme None of the considered approaches worked significantly and consistently better than simple averaging As a result, we give the same weight to all 3 data sources: wSEQ = wPPI = 1/3 wj EXP= 1/(3J)
  • 11. Results (average AUC) ver. 1 - neighbors found among only 2,869 overlapping human proteins ver. 2 - neighbors found among all 8,714 human proteins ver. 3 - neighbors found among all 36,924 CAFA training proteins Data source MF terms BP terms Microarray data 0.6442 0.6279 PPI data 0.6283 0.6671 Protein Sequence data, ver. 1 0.7636 0.6642 Protein Sequence data, ver. 2 0.7896 0.6921 Protein Sequence data, ver. 3 0.8396 0.7537 Integrating 3 data sources, ver. 1 0.8134 0.7468 Integrating 3 data sources, ver. 2 0.8494 0.7939 Integrating 3 data sources, ver. 3 0.8788 0.8165
  • 12. Discussion Several important conclusions arise: Gene expression is more useful for MF, while PPI is more useful for BP prediction Sequence similarity data is superior to both gene expression and PPI data It is beneficial to transfer functions to human proteins from their orthologues Integration of data sources improves AUC significantly for both MF and BP terms
  • 13. Results - Cont’d Comparison of AUC of sequence similarity scores (ver. 3) and integrated scores (ver. 3) for each GO term
  • 14. Conclusion Some sources are more beneficial for BP, while some for MF terms prediction Integration of different sources improves function prediction significantly Exploring new integration techniques could lead to even better results
  • 15. Thank you! Questions? Nemanja Djuric supported by travel grant from US Department of Energy