Doctoral Minor Research                                                Project Defense                                    ...
SUMMARY1. The problem    • Multidomain questions2. The proposed solution    • GPDW Data Warehouse    • Search Computing   ...
Data and search service scenario                     in the Life Sciences• In the Life Sciences: numerous data, sparsely d...
Life Science questions and their answering – Several Life Science questions:    - are complex    - to be answered require ...
Life Science multidomain question Example: “Which genes encode proteins in different organisms with high sequence similari...
GPDW Data Warehouse•       Several integrated databanks, including:         On-line databanks•       Entrez Gene, Ensembl•...
Search Computing project at PoliMiSearch Computing (SeCo) aims at: 1. Developing the informatics framework required for   ...
Bio-SeCo: SeCo technologies to answer                      Life Science questionsLife Science example query:     “Which ge...
Bio-SeCo: SeCo technologies to answer                    Life Science questions • “Which proteins in different organisms h...
Bio-SeCo: SeCo technologies to answer                    Life Science questions• “Which genes are up/down significantly co...
Bio-SeCo: SeCo technologies to answer                     Life Science questionsEach quesiton part answer is integrated wi...
What I have done for my Minor Research project12
Semantic network: before      Has             Gene               Protein                                                  ...
Semantic network: now          Genetic          Disorder                                                        Pathway   ...
Services I added: GPDW exploitation         Genetic         Disorder                                                 Pathw...
Services I added: GPDW exploitation          Genetic          Disorder                                                    ...
Services I added: GPDW exploitation          Genetic          Disorder                                                    ...
Services I added: GPDW exploitation          Genetic          Disorder                                                    ...
Services I added: GPDW exploitation          Genetic          Disorder                                                    ...
Services I added: GPDW exploitation          Genetic          Disorder                                                    ...
Services I added: GPDW exploitationA Biological Function Feature is an item of information about a        Geneticgene or a...
Services I added• These new services (Genetic Disorder and Pathway) are  very useful and important, but they don’t take ad...
Services I added: Gene Semantic Similarity• The other service (SemSim) I integrated on Bio-SeCo is  related to the computa...
Semantic Similarity?!? What does it mean?• Keypoint: given the gene X and gene Y, how much similar  are they?• Semanticall...
Biomolecular annotation• The concept of annotation: association of nucleotide or amino  acid sequences with useful informa...
Biomolecular annotation• The association of an information/feature with a gene or  protein ID constitutes an annotation• A...
Latente Semantic Indexing:               Singular Value Decomposition – SVD     – Annotation matrix A  {0, 1} m x n      ...
Latente Semantic Indexing:               Singular Value Decomposition – SVD     – Annotation matrix A  {0, 1} m x n      ...
Latente Semantic Indexing:          Singular Value Decomposition – SVDCompute SVD:                            A  U V T ...
Latente Semantic Indexing:          Singular Value Decomposition – SVDCompute reduced rank approximation:                 ...
Latente Semantic Indexing:          Singular Value Decomposition – SVD • Uk : gene vectors matrix • Σk : singular value ma...
Latente Semantic Indexing:          Singular Value Decomposition – SVD • Uk : gene vectors matrix • Σk : singular value ma...
Minor Research Project• A preprocessing software computes the Singular Value  Decomposition (SVD) algorithm• It prints the...
Minor Research Project• Developed with REST technology• Integrated on Bio-SeCo as an external service, with a wrapper• Inp...
Minor Research Project• Input: list of genes ranked on their semantic similarity with the  input gene35
Minor Research Project • Now is possible to answer to many other biological questions.   For example:     Among the protei...
Minor Research Project • Now is possible to answer to many other biological questions,   that involve Gene Semantic Simila...
Thanks for your attention38
Upcoming SlideShare
Loading in …5
×

Integration of Bioinformatics Web Services through the Search Computing Technology

2,807 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,807
On SlideShare
0
From Embeds
0
Number of Embeds
2,367
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Integration of Bioinformatics Web Services through the Search Computing Technology

  1. 1. Doctoral Minor Research Project Defense 19th November 2012 Dipartimento diElettronica e Informazione Integration of Bioinformatics Web Services through the Search Computing Technology Davide Chicco davide.chicco@elet.polimi.it
  2. 2. SUMMARY1. The problem • Multidomain questions2. The proposed solution • GPDW Data Warehouse • Search Computing • Bio-SeCo3. Developed and added services • Exploiting GPDW Data Warehouse • Semantic Similarity4. Conclusions
  3. 3. Data and search service scenario in the Life Sciences• In the Life Sciences: numerous data, sparsely distributed in many heterogeneous sources • Many are ranked data (or partially ranked) of various types, representing different phenomena, e.g.: – physical ordering, e.g. within a genome – analytical order through algorithmically assigned scores, e.g. representing levels of sequence similarity – experimentally measured values, such as gene expression levels • The ordering may represent a range of different notions, such as quantity, confidence, or location 3
  4. 4. Life Science questions and their answering – Several Life Science questions: - are complex - to be answered require integration and comprehensive evaluation of different data – often distributed, many of which ranked• Answering complex questions requires integration of vertical search services to create multi-topic searches • where the different topic searches either refine or augment previous search results• Bioinformatics data integration platforms exist – Ordered data are poorly served or no supported at all by current data integration platforms 4
  5. 5. Life Science multidomain question Example: “Which genes encode proteins in different organisms with high sequence similarity to a protein X and have some biomedical features in common e.g. up/down significantly co-expressed in the same biological tissue or condition Y and involved in the biological function Z?”Information to answer such queries is available on the Internet,but no available software system is capable of computing theanswerThe user should search indifferent resources, oftenindipendent. 5
  6. 6. GPDW Data Warehouse• Several integrated databanks, including: On-line databanks• Entrez Gene, Ensembl• Homologene• IPI, UniProt/Swiss-Prot Entrez IPI eVOC BioCyc KEGG Reactome GOA Gene• Gene Ontology, GOA Gene Homologene Ontology• BioCyc, KEGG, Reactome• InterPro, Pfam Automatic• OMIM, eVOC, … Database updating procedures server• Numerous integrated data, including: Genomic and Proteomic• 8,085,152 genes of 8,410 organisms Data Warehouse• 31,347,655 proteins of 367,853 specie• 33,252 Gene Ontology terms and 61,899 relations (is a, part of)• 27,667 biochemical pathways• 14,163 protein domains; 7,215 OMIM genetic disorders; … 6
  7. 7. Search Computing project at PoliMiSearch Computing (SeCo) aims at: 1. Developing the informatics framework required for computing multi-topic searches by combing single topic search results from search engines, which are often ranked, with other data and computational resources • directly supporting multi-topic ordered data • taking into account order when the results of several requests are combined • enabling exploration and expansion of search results 2. Applying SeCo technology in different fields, including Life Sciences => Bio-SeCo: Support answering complex bioinformatics queries 7
  8. 8. Bio-SeCo: SeCo technologies to answer Life Science questionsLife Science example query: “Which genes encode proteins in different organisms with high sequence similarity to a protein X and have some biomedical features in common, e.g. up/down significantly co-expressed in the same biological tissue or condition Y and involved in a biological function Z?”This multi-topic case study question can be decomposed intothe following four single topic sub-queries, each of these sub-queries can be mapped to an available search service. 8
  9. 9. Bio-SeCo: SeCo technologies to answer Life Science questions • “Which proteins in different organisms have high sequence similarity to a protein X ?”  BLAST, a sequence similarity search program, in one of its many implementations, e.g. WU-BLAST or NCBI-Blast• “Which genes encode which proteins ?”  GPDW (Genomic and Proteomic Data Warehouse), a query service to a database of genomic and proteomic data (GPDW_protein2gene) 9
  10. 10. Bio-SeCo: SeCo technologies to answer Life Science questions• “Which genes are up/down significantly co-expressed in the same biological condition / tissue Y ?”  Array Express Gene Expression Atlas, a search engine of gene expression data• “Which genes are involved in a biological function Z ?  GPDW (Genomic and Proteomic Data Warehouse), a query service to a database of genomic and proteomic data (GPDW_gene2biologicalFunctionFeature) 10
  11. 11. Bio-SeCo: SeCo technologies to answer Life Science questionsEach quesiton part answer is integrated with others, with all the ranked results found GPDW_protein2gene BLAST ArrayExpress GPDW_gene2biologicalFunctionFeature 11
  12. 12. What I have done for my Minor Research project12
  13. 13. Semantic network: before Has Gene Protein Is_similar_to Is_encoded_byGene Expression Is_involved_in Biological Function Is_involved_in13 Feature
  14. 14. Semantic network: now Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_inIs_functional_similar_to Codes Has Gene Protein Is_similar_to Is_encoded_by Gene Expression Is_involved_in Biological Function Is_involved_in 14 Feature
  15. 15. Services I added: GPDW exploitation Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_inIs_functional_similar_to Codes Has Gene ProteinA Genetic Disorder is an illness caused by abnormalities in genes Is_encoded_by Is_similar_toor Gene Expression chromosomes, especially a condition that is present frombefore birth.In biochemistry, Metabolic Pathways are series of chemical Is_involved_inreactions occurring within Biological Function pathway, a principal a cell. In each Is_involved_in Featurechemical is modified by a series of chemical reactions. 15
  16. 16. Services I added: GPDW exploitation Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_inIs_functional_similar_to Codes Has Gene ProteinWhich Genetic Disorders isIs_encoded_byX involved in ? the Gene Is_similar_to Gene Expression GPDW (Genomic and Proteomic Data Warehouse), a query service to a database of genomic and proteomic data (GPDW_gene2geneticDisorder) Is_involved_in Biological Function Is_involved_in 16 Feature
  17. 17. Services I added: GPDW exploitation Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_inIs_functional_similar_to Codes Has Gene ProteinWhich Genetic Disorders isIs_encoded_by Y involved in ? the Protein Is_similar_to Gene Expression GPDW (Genomic and Proteomic Data Warehouse), a query service to a database of genomic and proteomic data (GPDW_protein2geneticDisorder) Is_involved_in Biological Function Is_involved_in 17 Feature
  18. 18. Services I added: GPDW exploitation Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_inIs_functional_similar_to Codes Has Gene ProteinWhich Genes does the Genetic Disorder X involve? Is_encoded_by Is_similar_to Gene Expression GPDW (Genomic and Proteomic Data Warehouse), a query service to a database of genomic and proteomic data (GPDW_geneticDisorder2gene) Is_involved_in Biological Function Is_involved_in 18 Feature
  19. 19. Services I added: GPDW exploitation Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_inIs_functional_similar_to Codes Has Gene ProteinWhich Proteins does the Genetic Disorder X involve? Is_encoded_by Is_similar_to Gene Expression GPDW (Genomic and Proteomic Data Warehouse), a query service to a database of genomic and proteomic data (GPDW_geneticDisorder2gene) Is_involved_in Biological Function Is_involved_in 19 Feature
  20. 20. Services I added: GPDW exploitation Genetic Disorder Pathway Is_involved_in Is_involved_in Is_involved_in Is_involved_inIs_functional_similar_to Codes Has Gene ProteinSame questions and GPDWIs_encoded_by Metabolic Pathways: services for Is_similar_to• GPDW_gene2pathway Gene Expression• GPDW_protein2pathway• GPDW_pathway2gene• GPDW_pathway2protein Is_involved_in Biological Function Is_involved_in 20 Feature
  21. 21. Services I added: GPDW exploitationA Biological Function Feature is an item of information about a Geneticgene or a protein. It defines a certain peculiarity of a biomolecular Disorder Pathwayentity. E.g.: “is involved in lung cancer” Is_involved_in Is_involved_in Is_involved_inGPDW_protein2biological_function_feature Is_involved_inIs_functional_similar_to Codes Has Gene Protein Is_similar_to Is_encoded_by Gene Expression Is_involved_in Biological Function Is_involved_in 21 Feature
  22. 22. Services I added• These new services (Genetic Disorder and Pathway) are very useful and important, but they don’t take advantage of the main novelty provided by the Search Computing technology: the Integration of ranked results• There’s no ranking on “being involved” in a Genetic 22Disorder or a Pathway…
  23. 23. Services I added: Gene Semantic Similarity• The other service (SemSim) I integrated on Bio-SeCo is related to the computation of the semantic similarity of a gene into a list of genes: Is_functional_similar_to Gene• This service provides ranked results (given a gene X, it returns a list of gene ranked from the most semantic similar to X to the less semantic similar one)• SemSim takes advantage of the Search Computing potentiality of integrating ranked results 23
  24. 24. Semantic Similarity?!? What does it mean?• Keypoint: given the gene X and gene Y, how much similar are they?• Semantically similar genes can be involved in similar activities, can be involved in similar pathways, and can have many annotations in common• To measure this similarity, I chose Latent Semantic Indexing method, based on a matrix build with gene- related annotations 24
  25. 25. Biomolecular annotation• The concept of annotation: association of nucleotide or amino acid sequences with useful information describing their features• This information is expressed through controlled vocabularies, sometimes structured as ontologies, where every controlled term of the vocabulary is associated with a unique alphanumeric code• The association of such a code with a gene or protein ID constitutes an annotation Gene / Biological function feature Protein Annotation25 gene2bff
  26. 26. Biomolecular annotation• The association of an information/feature with a gene or protein ID constitutes an annotation• Annotation example: • gene: GD4 • feature: “is present in the mitochondrial membrane” Gene / Biological function feature Protein Annotation26 gene2bff
  27. 27. Latente Semantic Indexing: Singular Value Decomposition – SVD – Annotation matrix A  {0, 1} m x n − m rows: genes / proteins − n columns: annotation terms A(i,j) = 1 if gene / protein i is annotated to term j or to any descendant of j in the considered ontology structure (true path rule) A(i,j) = 0 otherwise (it is unknown) term01 term02 term03 term04 … termN gene01 0 0 0 0 … 0 gene02 0 1 1 0 … 1 … … … … … … … geneM 0 0 0 0 … 027
  28. 28. Latente Semantic Indexing: Singular Value Decomposition – SVD – Annotation matrix A  {0, 1} m x n − m rows: genes / proteins − n columns: annotation terms A(i,j) = 1 if gene / protein i is annotated to term j or to any descendant of j in the considered ontology structure (true path rule) A(i,j) = 0 otherwise (it is unknown) term01 term02 term03 term04 … termN gene01 0 0 0 0 … 0 gene02 0 1 1 0 … 1 … … … … … … … geneM 0 0 0 0 … 028
  29. 29. Latente Semantic Indexing: Singular Value Decomposition – SVDCompute SVD: A  U V T  U V T V TA  U V T A A U A  U V T Compute reduced rank approximation: Ak  U k kkVk U kU kVkkTVkTU k  kVkT A AT     k A Ak  U k kVkT  k k • An annotation prediction is performed by computing a reduced rank approximation Ak of the annotation matrix A (where 0 < k < r, with r the number of non zero singular values of A, i.e. the rank of A) 29
  30. 30. Latente Semantic Indexing: Singular Value Decomposition – SVDCompute reduced rank approximation: Ak  U k kkVk U kU kVkkTVkTU k  kVkT A AT     k A Ak  U k kVkT  k k • A : genes – features matrix • Uk : gene vectors matrix • Σk : singular value matrix • VTk : feature vectors matrix30
  31. 31. Latente Semantic Indexing: Singular Value Decomposition – SVD • Uk : gene vectors matrix • Σk : singular value matrix • VTk : feature vectors matrix • These matrices can be considered for measuring the distances between objects (genes or feature) in the k-dimensional space. • For example, is possibile to compute the distance between two gene vector to understand their similarity level. The same thing could be done for features.31
  32. 32. Latente Semantic Indexing: Singular Value Decomposition – SVD • Uk : gene vectors matrix • Σk : singular value matrix • VTk : feature vectors matrix • For our implementation of the LSI, we chose to compute the cosine similarity as measure of the semantic similarity between genes.32
  33. 33. Minor Research Project• A preprocessing software computes the Singular Value Decomposition (SVD) algorithm• It prints the matrices (Uk, Σk, VTk) in three different files• These files are inserted into the data directory of the SemSim REST web application• SemSim (JSP + Java) computes the Latent Semantic Indexing (LSI) measures and returns the ranked list of genes33
  34. 34. Minor Research Project• Developed with REST technology• Integrated on Bio-SeCo as an external service, with a wrapper• Input: gene (ID, name, taxonomy)34
  35. 35. Minor Research Project• Input: list of genes ranked on their semantic similarity with the input gene35
  36. 36. Minor Research Project • Now is possible to answer to many other biological questions. For example: Among the proteins that are encoded by genes, in Chicken organism, with higher functional semantic similarity to gene X, which are those with higher sequence similarity to protein Y ? Input SequenceSemSim ProteinByGene Alignment Output36
  37. 37. Minor Research Project • Now is possible to answer to many other biological questions, that involve Gene Semantic Similarity computation, Genetic Disorders or Metabolic Pathways. For example: Among the proteins that are encoded by genes, in Chicken organism, with higher functional semantic similarity to gene X, which are those with higher sequence similarity to protein Y ? Input SequenceSemSim ProteinByGene Alignment Output37 DEMO
  38. 38. Thanks for your attention38

×