Building A Scalable Open Source Storage Solution

9 months ago

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Do you like this presentation?

1 comment

Comments 1 - 1 of 1 comment previous next Post a comment

Login or Signup to post a comment
Login to SlideShare
Login to Twitter
Edit your comment Cancel

8 Favorites

1 Group

Building A Scalable Open Source Storage Solution - Presentation Transcript

  1. Building a scalable, open source storage and processing solution for biodiversity data Phil Cryer Anthony Goddard Thursday, November 12, 2009
  2. > Biodiversity Heritage Library's data  • all BHL storage is handled by the Internet Archive • 38,000+ scanned books • approximately 48 terabytes of data • unable to self-host Thursday, November 12, 2009
  3. > BHL - Europe • 3 year, EU funded project • 28 major natural history museums, botanical gardens and other cooperating institutions • third file-store of all BHL data • collecting cultural heritage from all over Europe Thursday, November 12, 2009
  4. > Data explosion • more data being created • more data being saved • more data tomorrow • storage has not kept up with Moore’s Law • this presentation will be saved online, more data! Thursday, November 12, 2009
  5. > Data explosion • more data being created • more data being saved • more data tomorrow • storage has not kept up with Moore’s Law • this presentation will be saved online, more data! Thursday, November 12, 2009
  6. > Potential #fail’s Thursday, November 12, 2009
  7. > Problem 1 - Data access • file size we can’t store • latency of large files • quality user experience • processing data-mining Thursday, November 12, 2009
  8. > Problem 1 - Data access • file size we can’t store Access • latency of large files • quality user experience • processing data-mining denied... Thursday, November 12, 2009
  9. > Problem 2 - Copyright concerns © • international copyright concerns • potential related funding issues • we’d rather not let this be an issue Thursday, November 12, 2009
  10. > Problem 3 - Redundancy Thursday, November 12, 2009
  11. > Problem 3 - Redundancy • computers crash • hard drives die • networks fail • natural disasters occur Thursday, November 12, 2009
  12. > Problem 3 - Redundancy • computers crash • hard drives die • networks fail • natural disasters occur but... This is NOT a problem! Thursday, November 12, 2009
  13. ...so plan for it. Thursday, November 12, 2009
  14. Thursday, November 12, 2009
  15. Current Thursday, November 12, 2009
  16. Thursday, November 12, 2009
  17. Thursday, November 12, 2009
  18. > Site 1 - Internet Archive Thursday, November 12, 2009
  19. > Site 2 - MBL, Woods Hole Thursday, November 12, 2009
  20. > Site 3 - NHM, London ...followed by new Data center Thursday, November 12, 2009
  21. Data Centre – “Darwin Repository” • €600,000 Funding secured from eContentPlus • Suitable location found with very good development potential in collaboration with Science Museum. • Economy of scale provides additional avenues for co- development of services. – Disaster Recovery and Business Continuity for all Museums (help with ongoing and running costs) • DCMS funding sought to help with development. – e-Infrastructure European initiative • Building Digital Repositories for Scientific Communities – PESI (Biodiversity) Thursday, November 12, 2009
  22. Proposed Data Centre Location Swindon Wroughton Science Museum ©2008 Google – Imagery ©2008 DigitalGlobe, Infoterra Ltd & Bluesky, GeoEye, Map data ©2008 Tele Atlas Thursday, November 12, 2009
  23. Vendor Stakeholders / Partners • Identified Technology Partners* • Additional Funding Partners* *Note: Discussions are ongoing with all Partners and may be at different stages Thursday, November 12, 2009
  24. Long Term Sustainability • No Dripping Tap – Business case should provide for significant self funding opportunities. • Diversity – Darwin Repository (Data Centre) will provide an economy of scale that will provide significant efficiency gains. • Green technology to minimise carbon footprint and provide industry leadership. Thursday, November 12, 2009
  25. > Distributed storage • write once, read anywhere • replication and fault tolerance • error correction • automatic redundancy • scalable horizontally Thursday, November 12, 2009
  26. > Distributed storage - Options • fully hosted storage (cloud) • hosted with own storage (private cloud) • self hosted with proprietary hardware (Sun Thumper) • self hosted with commodity hardware Thursday, November 12, 2009
  27. > Distributed storage - GlusterFS • GlusterFS: a cluster file-system capable of scaling to several peta- bytes • open source software on commodity hardware • tunable performance • simple to install and manage • offers seamless expansion Thursday, November 12, 2009
  28. > Distributed storage - Archival  • Fedora-commons is an open source repository • accounts for all changes, so built- in version control • provides disaster recover • open standards to mesh with future file formats • provides open sharing services such as OAI-PMH Thursday, November 12, 2009
  29. > Distributed storage - Mirrored data • now we have redundancy • in fact, multiple redundant copies • provides fault tolerance • offers load balancing • gives us future geographical distribution Thursday, November 12, 2009
  30. > Now we have lots of computers... Thursday, November 12, 2009
  31. > Distributed processing • more abilities available than just storing data • with distributed storage comes distributed processing • distributed processing means faster answers • faster answers mean new questions • lather, rinse, repeat Thursday, November 12, 2009
  32. > Distributed processing • make your data more useful • image and OCR processing • distributed web services • identifier resolution pools • map/reduce frameworks • generate new visualizations, text mining, NLP Thursday, November 12, 2009
  33. > Distributed processing Finder TaxonFinder TaxonFinder TaxonFinder TaxonFinder Taxon WebService WebService Load Balancer Cluster Node Cluster Site Request Thursday, November 12, 2009
  34. > Some assembly required (optional) • our example uses new, faster commodity hardware • but it could run on any hardware that can run Linux • you could chain old "out dated" computers together • build your own cluster for next to nothing (host it in your basement) • solves some infrastructure funding issues • hardware vendor neutrality Thursday, November 12, 2009
  35. > Our proof of concept • we ran a six box cluster to demonstrate GlusterFS • ran stock Debian/GNU Linux • simulated hardware failures • synced data with a remote cluster • ran map/reduce jobs • defined procedures, configurations and build scripts Thursday, November 12, 2009
  36. Thursday, November 12, 2009
  37. Thursday, November 12, 2009
  38. www presentation REST API mod_glusterfs disco processing Hadoop Fedora Commons metadata Mulgara triplestore / rdf rsync GlusterFS sync support file system BitTorrent Ext4 (exabyte) HTTP ‘Network RAID’ Raw disk array Commodity SATA controllers storage Commodity Hosts Dedicated Storage Network Thursday, November 12, 2009
  39. > Distributed storage - Projected costs $246,000 Graph from Backblaze (http://www.backblaze.com) Thursday, November 12, 2009
  40. > Other avenues - Cloud pilot • BHL is participating in a pilot with New York Public Library and Duraspace • Duraspace would provide a link to cloud providers • pilot to show feasibility of hosting • testing use of image server, other services in the cloud • cloud could seed new clusters Thursday, November 12, 2009
  41. > Code (63 6f 64 65) • all of our code and configurations are open source • hosted on Google Code • get involved • join the mailing-lists • follow us on Twitter • ask questions, we'll help! Thursday, November 12, 2009
  42. > It’s your turn... • similar projects? • distributed services and processing? • where can this be best applied? • resilient services on top of storage • names processing? • LSID resolution pools? • image processing? • text-mining / NLP? • #biodiv webservices? Thursday, November 12, 2009
  43. Web: http://www.biodiversitylibrary.org/ Code, Support: http://code.google.com/p/bhl-bits Twitter: @BioDivLibrary (tag #bhl) Phil Cryer Anthony Goddard Missouri Botanical Garden MBLWHOI Library Biodiversity Heritage Library Biodiversity Heritage Library phil.cryer@mobot.org agoddard@mbl.edu http://philcryer.com http://anthonygoddard.com @fak3r @anthonygoddard Thursday, November 12, 2009

Phil CryerPhil Cryer + Follow

2672 views, 8 favs, 1 embed more

About this presentation

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Stats

  • 8 Favorites
  • 1 Comments
  • 52 Downloads
  • 2,630 Views on
    SlideShare
  • 42 Views on
    Embeds
  • 2,672 Total Views

Embed views

  • 42 views on http://www.slideshare.net

more

Embed views

  • 42 views on http://www.slideshare.net

less

Accessibility

Additional Details

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

Cancel
File a copyright complaint

Follow SlideShare