Your SlideShare is downloading. ×
Unknown Genes, Community Profiling, &
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Unknown Genes, Community Profiling, &


Published on

Published in: Technology

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. unknown genes, Community Profiling,&
    Morgan Langille
    UC Davis
  • 2. Genes with unknown function
  • 3. Questions
    If we wanted to start studying a gene of unknown function, which one(s) should we study first?
    How many un-annotated genes could be annotated?
    What proportion of unknown genes (hypothetical proteins) are probably not real proteins (i.e. pseudo-genes, mis-predicted orfs, etc.) ?
    What proportion of unknown gene families are probably phage-related?
    Can some of these families (hopefully the top ranking ones) be characterized using non-similarity based bioinformatic approaches?
  • 4. Outline of project
  • 5. Community Profiling
  • 6. Phylogenetic profiling
    Wu, et al., PLOS Genetics, 2005
    C. hydrogenoformansidentified presence or absence of homologs in all other completely sequence genomes
    Identified many hypothetical proteins that had the same profile as other sporulation proteins
  • 7. Community Profiling
    Delong, et al., Science, 2006
  • 8. Community Profiling
    Look across multiple metagenomic samples
    Gene families that have similar profiles may have similar function
    Similar to using co-expression to identify similar functioning genes
  • 9. So what have I done?
    "all metagenomics peptides" from CAMERA
    43M sequences (mostly GOS)
    Searched against 11,000 Pfams using HMMER 3
    Used “cluster” to group genes and samples
  • 10. Results
    Metagenomic Samples
    Red = above avg. number of pfams
    Green = below avg. number of pfams
    Have not normalized
    Number of sequences per sample
    For number of pfams
  • 11. Example of phage Pfams clustering together
  • 12. Measuring functional relatedness
    Need to measure community profiling performance
    The hierarchal clusters were broken into 575 groups using a correlation cutoff of 0.90 or above.
    PFams were mapped to GO terms using pfam2GO
    1893 PFams had no associated GO term
    695 of these were Domains of Unknown Function:DUFs
    3377 PFams had one or more associated GO terms and could be used for further analysis
    Only 67 (of 575) clusters contained 4 or more PFams with at least one GO term
  • 13. Measuring GO similarity
    Measures the semantic similarity of any two GO terms
    Not downloadable so queries had to be made to their web server (not fun)
    Pair-wise similarity was measure for each pair of GO terms in each cluster
    had to check if terms were in same namespace
  • 14. Results
    Average G-Sesame scores for each cluster
    The average of all cluster averages was 0.484
    10 clusters had a score of 0.60 or greater.
    The data was then randomized by using the same GO terms but in different random clusters and a score of 0.412-0.420 over 4 iterations
    Each of the 4 iterations had only 1 or 0 clusters with a score of 0.60 or greater
  • 15. Community Profiling Results
    • Average of all clusters= 0.49
    • 16. 10 clusters are > 0.60
  • Random Results
    • Average of all clusters (4 iterations) = 0.41 - 0.42
    • 17. 1 or 0 clusters are > 0.60
  • BioTorrents
  • 18. Bittorrent
    A peer-to-peer file sharing protocol
    ~ 27-55% of all Internet traffic
    Mostly illegal file sharing
    Files are shared in small
    pieces between several
  • 19. Torrents for Biology
    Why use torrent technology?
    Download large datasets much faster
    Searchable central listing
    Decentralization of data
  • 20. What is BioTorrents?
    A legal file sharing website for scientists
    Users can upload their own research results, data, software
    Users can browse or search through all datasets
    Data is not hosted on BioTorrents
  • 21.
  • 22. Browse & Search
  • 23. Details
  • 24. Sign Up
  • 25. Upload
  • 26. Other Features
    RSS Feed
    Top 10
  • 27. Who will upload data?
    Large organizations (e.g. NCBI, CAMERA, etc.)
    May need some convincing to host their data via torrents in addition to FTP, HTTP, etc.
    Scientists that really support open science
    Sharing data before formally complete and published
  • 28. Technical Challenges
    Many institutions frown on BitTorrent technology
    A port must be opened/forwarded
    Client program and computer must be left running
    Ensuring data is legal, virus free, etc.
    Users that upload many legitimate torrents will provide more confidence to people downloading
    Making downloading and uploading easy