Harvester I


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Harvester I

    1. 1. Bioinformatic Harvester I Education and Training A cademia S inica I nstitute of B io m edical S ciences Biomedical IT Core Ming-Fang Tsai [email_address]
    2. 2. Agenda 1. Introduction 2. How does Harvester work? 3. How to Query in Harvester? 4. Summary 2. How does Harvester work? 3. How to Query in Harvester? 4. Summary 1. Introduction
    3. 3. Introduction
    4. 4. Introduction <ul><li>有關 gene, protein 相關的資料庫愈來愈多 </li></ul><ul><ul><li>要從這些不同的資料庫找到有用的資訊 , 必須 整合 及 比較 . </li></ul></ul><ul><li>不同的資料庫操作方式相異 </li></ul><ul><ul><li>特別是那些 資料還不夠齊全 的基因跟蛋白質 , 需花費 較多 精神 , 而且還可能找到 不確定 的資料 . </li></ul></ul><ul><li>同一個基因或蛋白質 , 在不同網站上的資料有可能不同 </li></ul><ul><ul><li>必須過濾掉這些雜訊 . </li></ul></ul><ul><li>Harvester 可以解決以上問題 . </li></ul>
    5. 5. About Harvester <ul><li>Harvester is a software tool, developed by the European Molecular Biology Laboratory (EMBL) -Heidelberg, Germany. </li></ul><ul><li>A bioinformatics meta search engine for genes and protein associated information. </li></ul>meta search 就像是同時在 yahoo, google 查資料一樣
    6. 6. <ul><li>Crosslink 16 major bioinformatics resources and allow cross searches. </li></ul><ul><li>Sort the search results and display most relevant information. </li></ul><ul><li>Cache the result locally and allow access to it conveniently and quickly. </li></ul>About Harvester
    7. 7. The Interface <ul><li>http:// harvester.embl.de / </li></ul>
    8. 8. About Harvester Cross link bioinformatics resources Allow to search in three genomes http:// harvester.embl.de / Harvester is a search engine for gene and protein information.
    9. 9. Exercise <ul><li>http:// harvester.embl.de / </li></ul><ul><li>認識系統介面 </li></ul>
    10. 10. Data super-integration tool <ul><li>Bioinformatic Harvester provides information from alreday &quot; integrative &quot; databases (like SOURCE, Ensembl, UCSC, and NCBI Entrez,...) in one single webpage. </li></ul>
    11. 11. Databases Harvester RZPD iHOP STRING Entrez Homolo Gene IPI GFP-cDNA PSORT Uniport SMART SOURCE SOSUI OMIM Genome-Browser NCBI- Blast Ensembl
    12. 12. Cross search
    13. 13. Information Retrieval <ul><li>With Harvester, researchers can quickly and easily search for: </li></ul><ul><ul><li>• links to diseases </li></ul></ul><ul><ul><li>• protein domains and homologies </li></ul></ul><ul><ul><li>• a summary of protein function and latest literature information </li></ul></ul><ul><ul><li>• predicted protein locali z ation </li></ul></ul><ul><ul><li>• experimentally verified protein locali z ation (gfp-cdna project) </li></ul></ul><ul><ul><li>• all known protein sequences </li></ul></ul><ul><ul><li>• gene synonyms and most database identifiers </li></ul></ul>
    14. 14. Exercise <ul><li>http:// harvester.embl.de / </li></ul><ul><li>輸入 keyword </li></ul>
    15. 15. How Harvester Works?
    16. 16. How does Harvester work? <ul><li>Collect information from gene and protein databases along with prediction servers. </li></ul><ul><li>Search index based on the UniProt </li></ul><ul><li>Current offers information for 80,000 human & 72,000 mouse & 20,000 rat proteins. </li></ul><ul><li>Text-based information: UNIPROT, SOURCE, SMART, SOSUI, PSORT, RZPD, Homologene, gfp-cDNA, IPI </li></ul><ul><li>Graphical rich information: NCBI BLAST, Genome-Browser, Ensembl, RZPD, CDART, STRING ( Databases rich in graphical elements are not collected, but cross-link via iframes ) </li></ul>
    17. 17. Text Information <ul><li>For the text results, Bioinformatic Harvester provides a special ranking system, similar to Google Page . </li></ul><ul><li>Rank, to sort the search results and present the most relevant information first. </li></ul>
    18. 18. Graphical Information <ul><li>Bioinformatic Harvester displays database hits which are rich in graphical information as “iframes”. </li></ul><ul><li>The “iframes” provides the user the latest information from the original database server. </li></ul>
    19. 19. How does Harvester work? <ul><li>Data from PSORT and SMART is pre-computed, collected and indexed. (Pages older than 21 days are continuously updated) </li></ul><ul><li>Text-based data is optimal indexed , the redundant information is removed by the converter modules. </li></ul><ul><li>Predictions that are identical on two or more servers are scored higher than predictions returned by only one server. </li></ul>
    20. 20. Database Identifier Cross-links <ul><li>Every Harvester page contains database identifiers from various sources.(ex: SIRT2 ) </li></ul>
    21. 21. Database Identifier Cross-links Harvester ID converter : http://www- db.embl.de/jss/servlet/de.embl.bk.harvester.IPIMapper Unigene Cluster Hs.466693 InterPro IPR003000 RefSeq NM_012237 Ensembl ENSG00000068903 UniProt Q92830 SIRT2
    22. 22. How to Query in Harvester?
    23. 23. Provide three genomes for searching. (HUMAN, MOUSE, RAT)
    24. 24. The Ways to Query <ul><li>You may use (single) sequences or (single) gene names , GenBank accessions , protein domains (&quot;SH3&quot;), protein motifs (&quot;SEQ:KDEL&quot;), protein localization (&quot;endoplasmic reticulum&quot;), literature , authors (&quot;Straussberg&quot;). </li></ul><ul><li>You may also combine searches simply by entering multiple words (enable the checkbox &quot; AND &quot;). </li></ul><ul><li>By default: “ OR &quot;. </li></ul>
    25. 25. Queries Type brca1 DNA binding 17q21 the ring-type zinc finger domain interacts with bap1 plasma membrane occurs in hemizygous males brain malformation Cloning and expression of a human CDC42 Barfold highly expressed in heart automatic–manual annotation author paper title disease‐related information disease‐related information Localization protein domains chromosomal location molecular function gene name and aliases Example Query
    26. 26. The limitation <ul><li>Not possible to restrict the search to specific fields , but the search is always performed &quot;full text&quot;. </li></ul>
    27. 27. Exercise <ul><li>http:// harvester.embl.de / </li></ul><ul><li>AND, OR </li></ul>
    28. 29. The First Output <ul><li>The first output is a list of hits which were retrieved from the database using one of the described options. </li></ul><ul><li>Note that there is a field &quot; maximum shown hits &quot; (ranging from 25 to 10) </li></ul>
    29. 30. The Second Output <ul><li>The second output is generated when one of the hits in the list is opened. </li></ul><ul><li>Then the results of all different databases corresponding to one single gene / protein are displayed in one single HTML page. </li></ul><ul><li>Similar to the Google “ page-rank ”. </li></ul>
    30. 31. The Second Output displays the sum of all word scores C : number of different search words on the page S : total number of word hits on the page a short feature summary of the protein a short feature summary of the protein
    31. 32. Score Score : the number of occurrences of the word on this page divided by the total number of occurrences on all pages in the index *100000 (for readability)
    32. 33. Score (113/2626)*100000 =4303 score Score : the number of occurrences of the word on this page divided by the total number of occurrences on all pages in the index *100000 (for readability)
    33. 34. Excluding Search <ul><li>“ AND ” can be combined with “ NOT ” </li></ul>
    34. 35. Sort Result NOT, AND, Cluster NOT AND Cluster
    35. 36. Exercise <ul><li>http:// harvester.embl.de / </li></ul><ul><li>輸入 Cdc42 Cdc42GAP 後 , 如何得到以下結果 ? </li></ul>
    36. 37. Assemble the information on a single HTML page amino acid sequence Link out
    37. 38. Link Out <ul><li>IPI (International Protein Index) : assembled from protein sequence information taken from the following data sources : UniProt, RefSeq, Ensembl and so on. </li></ul><ul><li>RZPD : German resources Center for genome research in Berlin/Heidelberg . </li></ul><ul><li>OMIM </li></ul><ul><li>Entrez (NCBI) </li></ul><ul><li>Google Scholar : a freely accessible web search engine that indexes the full-text of literature </li></ul><ul><li>IHOP (Information Hyperlinked over Proteins) : an online service that provides a gene-guided network to access PubMed abstracts. </li></ul><ul><li>GoPubmed : ontology based literature search </li></ul><ul><li>H-Inv : an integrated database of human genes and transcripts </li></ul><ul><li>Mitocheck : an integrated research project to study the regulation of mitosis in human cells. </li></ul>
    38. 39. Cross search
    39. 40. 可在同一頁中更動結果
    40. 41. 可回上一頁及下一頁 <ul><li>&quot; Back &quot; and &quot; Forward &quot; buttons are functions for these &quot;in-page windows&quot;. </li></ul>
    41. 42. DEMO <ul><li>http:// harvester.embl.de / </li></ul>
    42. 43. Summary
    43. 44. Summary <ul><li>Multiple ways to query &quot;Harvester“ </li></ul><ul><li>Link out to different database like gene-centered, sequence , protein </li></ul><ul><li>All web links on “harvester” pages can be saved locally </li></ul><ul><li>The ranking system works like Google “ page-rank ”. </li></ul><ul><li>A human edited feedback platform </li></ul><ul><ul><li>Harvester Wiki ( Hwiki ) </li></ul></ul>
    44. 45. Next Topic <ul><li>Advance search </li></ul><ul><ul><li>sequence search </li></ul></ul><ul><ul><li>search by application </li></ul></ul><ul><li>Database </li></ul><ul><ul><li>Gene : SOURCE, Ensembl, UCSC Genome Browser, NCBI Entrez, UniProt. </li></ul></ul><ul><ul><li>Sequence homology : NCBI BLAST, Homologene, BLAST Link. </li></ul></ul><ul><ul><li>Protein domains : SMART, CDART. </li></ul></ul><ul><ul><li>Protein trans-membrane prediction : SOSUI. </li></ul></ul><ul><ul><li>Protein localization prediction : PSORT2. </li></ul></ul><ul><ul><li>Protein localization image data : GFP-cDNA. </li></ul></ul><ul><ul><li>Gene-Disease Association : OMIM. </li></ul></ul><ul><ul><li>Sequence clone distribution : RZPD. </li></ul></ul><ul><ul><li>Protein-protein interactions : STRING, iHOP. </li></ul></ul>
    45. 46. Thank you !!