Unknown Genes, Community Profiling, & Biotorrents.net

1,226 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,226
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Unknown Genes, Community Profiling, & Biotorrents.net

  1. 1. unknown genes, Community Profiling,& Biotorrents.net<br />Morgan Langille<br /> UC Davis<br />
  2. 2. Genes with unknown function<br />
  3. 3. Questions<br />If we wanted to start studying a gene of unknown function, which one(s) should we study first?<br />How many un-annotated genes could be annotated?<br />What proportion of unknown genes (hypothetical proteins) are probably not real proteins (i.e. pseudo-genes, mis-predicted orfs, etc.) ?<br />What proportion of unknown gene families are probably phage-related?<br />Can some of these families (hopefully the top ranking ones) be characterized using non-similarity based bioinformatic approaches? <br />
  4. 4. Outline of project<br />
  5. 5. Community Profiling<br />
  6. 6. Phylogenetic profiling<br />Wu, et al., PLOS Genetics, 2005<br />C. hydrogenoformansidentified presence or absence of homologs in all other completely sequence genomes<br />Identified many hypothetical proteins that had the same profile as other sporulation proteins<br />
  7. 7. Community Profiling<br />KEGG<br />COG<br />Delong, et al., Science, 2006<br />
  8. 8. Community Profiling<br />Look across multiple metagenomic samples<br />Gene families that have similar profiles may have similar function<br />Similar to using co-expression to identify similar functioning genes<br />
  9. 9. So what have I done? <br />&quot;all metagenomics peptides&quot; from CAMERA <br />43M sequences (mostly GOS)<br />Searched against 11,000 Pfams using HMMER 3<br />Used “cluster” to group genes and samples<br />
  10. 10. Results<br />Metagenomic Samples<br />Red = above avg. number of pfams<br />Green = below avg. number of pfams<br />Have not normalized<br />Number of sequences per sample<br />For number of pfams<br />Pfams<br />
  11. 11. Example of phage Pfams clustering together<br />
  12. 12. Measuring functional relatedness <br />Need to measure community profiling performance<br />The hierarchal clusters were broken into 575 groups using a correlation cutoff of 0.90 or above. <br />PFams were mapped to GO terms using pfam2GO<br />1893 PFams had no associated GO term <br />695 of these were Domains of Unknown Function:DUFs<br />3377 PFams had one or more associated GO terms and could be used for further analysis <br />Only 67 (of 575) clusters contained 4 or more PFams with at least one GO term <br />
  13. 13. Measuring GO similarity<br />G-SESAME <br />Measures the semantic similarity of any two GO terms<br />Not downloadable so queries had to be made to their web server (not fun)<br />Pair-wise similarity was measure for each pair of GO terms in each cluster <br />had to check if terms were in same namespace<br />
  14. 14. Results<br />Average G-Sesame scores for each cluster<br />The average of all cluster averages was 0.484 <br />10 clusters had a score of 0.60 or greater. <br />The data was then randomized by using the same GO terms but in different random clusters and a score of 0.412-0.420 over 4 iterations <br />Each of the 4 iterations had only 1 or 0 clusters with a score of 0.60 or greater <br />
  15. 15. Community Profiling Results<br /><ul><li> Average of all clusters= 0.49
  16. 16. 10 clusters are > 0.60</li></li></ul><li>Random Results<br /><ul><li> Average of all clusters (4 iterations) = 0.41 - 0.42
  17. 17. 1 or 0 clusters are > 0.60</li></li></ul><li>BioTorrents<br />
  18. 18. Bittorrent<br />A peer-to-peer file sharing protocol<br />~ 27-55% of all Internet traffic<br />Mostly illegal file sharing<br />Files are shared in small <br /> pieces between several <br /> users<br />
  19. 19. Torrents for Biology<br />Why use torrent technology?<br />Download large datasets much faster<br />Searchable central listing<br />Decentralization of data<br />
  20. 20. What is BioTorrents?<br />A legal file sharing website for scientists<br />Users can upload their own research results, data, software<br />Users can browse or search through all datasets<br />Data is not hosted on BioTorrents<br />
  21. 21. www.biotorrents.net<br />
  22. 22. Browse & Search<br />
  23. 23. Details<br />
  24. 24. Sign Up<br />
  25. 25. Upload<br />
  26. 26. Other Features<br />Forum<br />RSS Feed<br />Top 10<br />FAQ<br />Links<br />
  27. 27. Who will upload data?<br />Everyone! <br />Realistically,<br />Large organizations (e.g. NCBI, CAMERA, etc.) <br />May need some convincing to host their data via torrents in addition to FTP, HTTP, etc. <br />Scientists that really support open science<br /> Sharing data before formally complete and published <br />
  28. 28. Technical Challenges <br />Many institutions frown on BitTorrent technology<br />A port must be opened/forwarded<br />Client program and computer must be left running<br />Ensuring data is legal, virus free, etc.<br />Users that upload many legitimate torrents will provide more confidence to people downloading<br />Making downloading and uploading easy<br />

×