BioTorrents: A File Sharing Service for Scientific Data


Published on

I present an overview of This was presented at the Open Science Summit 2010 conference in Berkeley, CA.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 2) Mention strict structure of existing dbs
  • Bandwidth limited because of intermediate links (geospatial)
  • 25-50% of all Internet traffic is BitTorrent traffic
  • Link to paper
  • Existing data providers (NCBI, EBI, JGI, etc.) Scientists sharing manuscript supplementary data All data is bundled together and given a unique id Easier than setting up a Web/FTP server Scientists that want to provide immediate access to their data and results Pre/post publication Data that might not be suitable for existing databases Results that may not be sufficient for publication
  • BioTorrents: A File Sharing Service for Scientific Data

    1. 1. Morgan Langille, PhD Open Science Summit 2010 Berkeley, California July 29 st , 2010
    2. 2. Acknowledgements <ul><li>iSEEM project </li></ul><ul><li>Dr. Jonathan Eisen </li></ul><ul><li>UC Davis </li></ul><ul><li>Questions/Comments </li></ul><ul><li>Twitter: @BetaScience </li></ul>
    3. 3. Motivation <ul><li>Data in science is growing rapidly </li></ul><ul><li>Transfer times increasing </li></ul><ul><li>Reliability of data transfer </li></ul><ul><li>Sharing scientific data openly </li></ul>
    4. 4. Personal Challenges <ul><li>Improve download speed and reliability from large data providers </li></ul><ul><li>Encourage sharing of all data associated with a study </li></ul><ul><li>Allow easier sharing of unpublished data </li></ul>
    5. 5. Traditional file transfer methods <ul><li>Single source server </li></ul><ul><li>Bandwidth limitations </li></ul><ul><li>No data redundancy </li></ul><ul><li>No data verification </li></ul>
    6. 6. Peer-to-peer file transfer: BitTorrent <ul><li>Data is shared between all computers </li></ul><ul><li>Bandwidth grows as users increases </li></ul><ul><li>Data redundancy </li></ul><ul><li>Data is verified </li></ul><ul><ul><li>Sha1 cryptographic hash </li></ul></ul><ul><li>25-50% of all Internet traffic is BitTorrent </li></ul>
    7. 7. BitTorrent: How it works <ul><li>User installs BitTorrent client software </li></ul><ul><li>User downloads a small “.torrent” descriptor file </li></ul><ul><li>Client software connects to “Tracker” to obtain a list of other “peers” with same data </li></ul><ul><li>Client begins downloading/uploading </li></ul>.torrent “ Tracker” server
    8. 8. Other BitTorrent Advantages <ul><li>Every dataset is given a unique id (Sha1 hash) </li></ul><ul><li>Distributed Hash Table (DHT) & Peer Exchange (PEX) </li></ul><ul><ul><li>Tracker-less peer identification </li></ul></ul><ul><li>Local Peer Discovery (LPD) </li></ul><ul><ul><li>Finds peers on local area network (LAN) allowing much faster data transfer </li></ul></ul><ul><li>Web Seeds </li></ul><ul><ul><li>FTP or HTTP resources can be added to the torrent </li></ul></ul>
    9. 9. BitTorrent Trackers <ul><li>Many trackers already exist </li></ul><ul><li>Almost all have legal issues with copyright infringement issues </li></ul><ul><li>None are tailored to hosting scientific datasets </li></ul>
    10. 10. <ul><li>BioTorrents is a file sharing website for scientists </li></ul><ul><li>BioTorrents provides a central listing of datasets </li></ul><ul><li>Anyone can upload their own data </li></ul><ul><li>All data must be “open”; no illegal file sharing </li></ul><ul><li>Data is not hosted on BioTorrents** </li></ul>Langille & Eisen, 2010, PLoS ONE 5: e10071 .
    11. 11. BioTorrents: Advanced Features <ul><li>Browse and search by </li></ul><ul><ul><li>Keyword (dataset title and description) </li></ul></ul><ul><ul><li>Category (Genomics, Proteomics, Chemistry, etc.) </li></ul></ul><ul><ul><li>License (Public Domain, Creative Commons, GPL, etc.) </li></ul></ul><ul><ul><li>Username (mlangill, jeisen, NCBI, etc.) </li></ul></ul><ul><li>RSS feeds and automatic downloading </li></ul><ul><li>Torrents linked into “Versions” </li></ul><ul><li>Upload script for bulk torrent creation </li></ul>
    12. 12. BioTorrents progress <ul><li>1000 registered users </li></ul><ul><li>43 datasets (107 GB) </li></ul><ul><li>766 downloads </li></ul><ul><li>1386 GB data transferred </li></ul>
    13. 13. Real Example <ul><li>Download GenBank (~230GB) from NCBI </li></ul>NCBI to UC Davis Download speed Time Max 30MB/s 2 hours FTP to other server ~10MB/s 6 hours FTP to NCBI ~.5MB/s 5 days
    14. 14. Who will use BioTorrents? <ul><li>Existing large data providers </li></ul><ul><ul><li>More reliable and faster downloads for users </li></ul></ul><ul><ul><li>Less bandwidth requirements for provider </li></ul></ul><ul><li>Scientists sharing published data </li></ul><ul><ul><li>All data is bundled together and given a unique id </li></ul></ul><ul><ul><li>Easier than setting up a Web/FTP server </li></ul></ul><ul><li>Scientists sharing unpublished data </li></ul><ul><ul><li>Data that might not be suitable for existing databases </li></ul></ul><ul><ul><li>Results that may not be sufficient for publication </li></ul></ul>
    15. 15. Issues <ul><li>BitTorrent works best for large, popular datasets </li></ul><ul><li>Long term seeding </li></ul><ul><ul><li>At least 1 seeder has to exist </li></ul></ul><ul><li>Many institutions block/limit BitTorrent activity </li></ul>
    16. 16. Future <ul><li>Metalink </li></ul><ul><ul><li>XML Link Protocol </li></ul></ul><ul><ul><li>Combines multiple sources </li></ul></ul><ul><ul><ul><li>FTP, HTTP, BitTorrent, etc. </li></ul></ul></ul><ul><li>Volunteer Storage </li></ul><ul><ul><li>Parallel to volunteer computing </li></ul></ul>
    17. 17. Final Message <ul><li>Data transfer should be fast and easy </li></ul><ul><li>Scientific community should embrace existing technologies such as BitTorrent </li></ul><ul><li>BioTorrents uses the strengths of BitTorrent and provides features unique to scientific data </li></ul>