SlideShare a Scribd company logo
Building a scalable, open source
              storage and processing solution for
                        biodiversity data




                   Phil Cryer
                   Anthony Goddard


Thursday, November 12, 2009
> Biodiversity Heritage Library's data 


   • all BHL storage is handled by the
     Internet Archive

   • 38,000+ scanned books

   • approximately 48
     terabytes of data

   • unable to self-host




Thursday, November 12, 2009
> BHL - Europe


   • 3 year, EU funded project

   • 28 major natural history museums,
     botanical gardens and other
     cooperating institutions

   • third file-store of all BHL data

   • collecting cultural heritage from all
     over Europe




Thursday, November 12, 2009
> Data explosion


   • more data being created

   • more data being saved

   • more data tomorrow

   • storage has not kept up with
     Moore’s Law

   • this presentation will be saved
     online, more data!




Thursday, November 12, 2009
> Data explosion


   • more data being created

   • more data being saved

   • more data tomorrow

   • storage has not kept up with
     Moore’s Law

   • this presentation will be saved
     online, more data!




Thursday, November 12, 2009
> Potential #fail’s




Thursday, November 12, 2009
> Problem 1 - Data access


   • file size we can’t store

   • latency of large files

   • quality user experience

   • processing data-mining




Thursday, November 12, 2009
> Problem 1 - Data access


   • file size we can’t store


                                Access
   • latency of large files

   • quality user experience

   • processing data-mining
                                denied...


Thursday, November 12, 2009
> Problem 2 - Copyright concerns




                                        ©
   • international copyright concerns

   • potential related funding issues

   • we’d rather not let this be an
     issue




Thursday, November 12, 2009
> Problem 3 - Redundancy




Thursday, November 12, 2009
> Problem 3 - Redundancy


   • computers crash

   • hard drives die

   • networks fail

   • natural disasters occur




Thursday, November 12, 2009
> Problem 3 - Redundancy


   • computers crash

   • hard drives die

   • networks fail

   • natural disasters occur


                              but...

              This is NOT a problem!




Thursday, November 12, 2009
...so plan for it.



Thursday, November 12, 2009
Thursday, November 12, 2009
Current




Thursday, November 12, 2009
Thursday, November 12, 2009
Thursday, November 12, 2009
> Site 1 - Internet Archive




Thursday, November 12, 2009
> Site 2 - MBL, Woods Hole




Thursday, November 12, 2009
> Site 3 - NHM, London




                              
   ...followed by new Data center
Thursday, November 12, 2009
Data Centre – “Darwin Repository”

      • €600,000 Funding secured from eContentPlus
      • Suitable location found with very good development
        potential in collaboration with Science Museum.
      • Economy of scale provides additional avenues for co-
        development of services.
         – Disaster Recovery and Business Continuity for all
           Museums (help with ongoing and running costs)
            • DCMS funding sought to help with development.
         – e-Infrastructure European initiative
            • Building Digital Repositories for Scientific
              Communities
               – PESI (Biodiversity)



Thursday, November 12, 2009
Proposed Data Centre Location


                              Swindon




          Wroughton Science Museum
                                        ©2008 Google – Imagery ©2008 DigitalGlobe, Infoterra Ltd & Bluesky, GeoEye, Map data ©2008 Tele Atlas




Thursday, November 12, 2009
Vendor Stakeholders / Partners

      • Identified Technology Partners*




      • Additional Funding Partners*




                              *Note: Discussions are ongoing with all Partners and may be at different stages




Thursday, November 12, 2009
Long Term Sustainability

      • No Dripping Tap
            – Business case should provide for significant
              self funding opportunities.
      • Diversity
            – Darwin Repository (Data Centre) will provide
              an economy of scale that will provide
              significant efficiency gains.
      • Green technology to minimise carbon footprint
        and provide industry leadership.



Thursday, November 12, 2009
> Distributed storage


   • write once, read anywhere

   • replication and fault tolerance

   • error correction

   • automatic redundancy

   • scalable horizontally




Thursday, November 12, 2009
> Distributed storage - Options


  • fully hosted storage (cloud)

  • hosted with own storage (private cloud)

  • self hosted with proprietary hardware (Sun
    Thumper)

  • self hosted with commodity hardware




Thursday, November 12, 2009
> Distributed storage - GlusterFS


   • GlusterFS: a cluster file-system
     capable of scaling to several peta-
     bytes

   • open source software on
     commodity hardware

   • tunable performance

   • simple to install and manage

   • offers seamless expansion




Thursday, November 12, 2009
> Distributed storage - Archival 


   • Fedora-commons is an open
     source repository

   • accounts for all changes, so built-
     in version control

   • provides disaster recover

   • open standards to mesh with
     future file formats

   • provides open sharing services
     such as OAI-PMH




Thursday, November 12, 2009
> Distributed storage - Mirrored data


   • now we have redundancy

   • in fact, multiple redundant
     copies

   • provides fault tolerance

   • offers load balancing

   • gives us future geographical
     distribution




Thursday, November 12, 2009
> Now we have lots of computers...




Thursday, November 12, 2009
> Distributed processing


   • more abilities available than just
     storing data

   • with distributed storage comes
     distributed processing

   • distributed processing means
     faster answers

   • faster answers mean new
     questions

   •    lather, rinse, repeat




Thursday, November 12, 2009
> Distributed processing


   • make your data more useful

   • image and OCR processing

   • distributed web services

   • identifier resolution pools

   • map/reduce frameworks

   • generate new visualizations, text
     mining, NLP




Thursday, November 12, 2009
> Distributed processing
Finder                 TaxonFinder    TaxonFinder          TaxonFinder    TaxonFinder   Taxon

                               WebService                          WebService

                                             Load Balancer

                                             Cluster Node

                                                Cluster

                                                    Site




                                                Request




 Thursday, November 12, 2009
> Some assembly required (optional)


   • our example uses new, faster
     commodity hardware

   • but it could run on any hardware
     that can run Linux

   • you could chain old "out dated"
     computers together

   • build your own cluster for next to
     nothing (host it in your basement)

   •   solves some infrastructure funding
       issues

   •   hardware vendor neutrality




Thursday, November 12, 2009
> Our proof of concept

   • we ran a six box cluster to
     demonstrate GlusterFS

   • ran stock Debian/GNU Linux

   • simulated hardware failures

   • synced data with a remote cluster

   • ran map/reduce jobs

   • defined procedures, configurations
     and build scripts




Thursday, November 12, 2009
Thursday, November 12, 2009
Thursday, November 12, 2009
www




                                                                       presentation
                                REST
             API                                     mod_glusterfs
                                            disco




                                                                      processing
                                           Hadoop
                                     Fedora Commons
              metadata




                                   Mulgara triplestore / rdf
                                 rsync                 GlusterFS
              sync support




                                                                      file system
                              BitTorrent             Ext4 (exabyte)
                                HTTP                ‘Network RAID’
                                      Raw disk array
                                Commodity SATA controllers
             storage




                                    Commodity Hosts
                                Dedicated Storage Network
Thursday, November 12, 2009
> Distributed storage - Projected costs




                                   $246,000




                              Graph from Backblaze (http://www.backblaze.com)


Thursday, November 12, 2009
> Other avenues - Cloud pilot

   • BHL is participating in a pilot with New
     York Public Library and Duraspace

   • Duraspace would provide a link to
     cloud providers

   • pilot to show feasibility of hosting

   • testing use of image server, other
     services in the cloud

   • cloud could seed new clusters




Thursday, November 12, 2009
> Code (63 6f 64 65)

   • all of our code and configurations are
     open source

   • hosted on Google Code

   • get involved

   • join the mailing-lists

   • follow us on Twitter

   • ask questions, we'll help!




Thursday, November 12, 2009
> It’s your turn...


   • similar projects?

   • distributed services and processing?

   • where can this be best applied?

   • resilient services on top of storage

              • names processing?

              • LSID resolution pools?

              • image processing?

              • text-mining / NLP?

              • #biodiv webservices?



Thursday, November 12, 2009
Web: http://www.biodiversitylibrary.org/
         Code, Support: http://code.google.com/p/bhl-bits
         Twitter: @BioDivLibrary (tag #bhl)




  Phil Cryer                          Anthony Goddard

  Missouri Botanical Garden           MBLWHOI Library
  Biodiversity Heritage Library       Biodiversity Heritage Library

      phil.cryer@mobot.org               agoddard@mbl.edu
      http://philcryer.com               http://anthonygoddard.com
      @fak3r                             @anthonygoddard



Thursday, November 12, 2009

More Related Content

What's hot

Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
Alex Moundalexis
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
WANdisco Plc
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011
GlusterFS
 
Unit ii sem-v-hadoop
Unit ii  sem-v-hadoopUnit ii  sem-v-hadoop
Unit ii sem-v-hadoop
DrChitraDhawale
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
Richard McDougall
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting
 
Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3
GlusterFS
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
DataWorks Summit
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
GlusterFS
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
Alex Moundalexis
 
Gluster Blog 11.15.2010
Gluster Blog 11.15.2010Gluster Blog 11.15.2010
Gluster Blog 11.15.2010
GlusterFS
 
Cloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentCloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and Deployment
GlusterFS
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS
 
Storing and distributing data
Storing and distributing dataStoring and distributing data
Storing and distributing data
Phil Cryer
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
Lars Nielsen
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
Jeff Hammerbacher
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
Altoros
 
Presentation introduction to cloud computing and technical issues
Presentation   introduction to cloud computing and technical issuesPresentation   introduction to cloud computing and technical issues
Presentation introduction to cloud computing and technical issues
xKinAnx
 
CloudStack-Developer-Day
CloudStack-Developer-DayCloudStack-Developer-Day
CloudStack-Developer-Day
Kimihiko Kitase
 

What's hot (20)

Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011
 
Unit ii sem-v-hadoop
Unit ii  sem-v-hadoopUnit ii  sem-v-hadoop
Unit ii sem-v-hadoop
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3Gluster Webinar: Introduction to GlusterFS v3.3
Gluster Webinar: Introduction to GlusterFS v3.3
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Gluster Blog 11.15.2010
Gluster Blog 11.15.2010Gluster Blog 11.15.2010
Gluster Blog 11.15.2010
 
Cloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentCloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and Deployment
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 Meetup
 
Storing and distributing data
Storing and distributing dataStoring and distributing data
Storing and distributing data
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Presentation introduction to cloud computing and technical issues
Presentation   introduction to cloud computing and technical issuesPresentation   introduction to cloud computing and technical issues
Presentation introduction to cloud computing and technical issues
 
CloudStack-Developer-Day
CloudStack-Developer-DayCloudStack-Developer-Day
CloudStack-Developer-Day
 

Similar to Building A Scalable Open Source Storage Solution

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias
 
Future of cloud storage
Future of cloud storageFuture of cloud storage
Future of cloud storage
GlusterFS
 
NDH2k12 Cloud Computing Security
NDH2k12 Cloud Computing SecurityNDH2k12 Cloud Computing Security
NDH2k12 Cloud Computing Security
Matthieu Bouthors
 
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CSBetter, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
John Burwell
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
Pascal-Nicolas Becker
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
Open Stack
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
BlueData, Inc.
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
saintdevil163
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clusters
Phil Cryer
 
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That WasPuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
Walter Heck
 
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That WasPuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
OlinData
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
Kelly Technologies
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summit
drewz lin
 
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Avere Systems
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
Minio
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heaven
Patrick Chanezon
 
Tutorial Haddop 2.3
Tutorial Haddop 2.3Tutorial Haddop 2.3
Tutorial Haddop 2.3
Atanu Chatterjee
 

Similar to Building A Scalable Open Source Storage Solution (20)

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Future of cloud storage
Future of cloud storageFuture of cloud storage
Future of cloud storage
 
NDH2k12 Cloud Computing Security
NDH2k12 Cloud Computing SecurityNDH2k12 Cloud Computing Security
NDH2k12 Cloud Computing Security
 
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CSBetter, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clusters
 
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That WasPuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
 
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That WasPuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
PuppetCamp SEA @ Blk 71 - Puppet: The Year That Was
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summit
 
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heaven
 
Tutorial Haddop 2.3
Tutorial Haddop 2.3Tutorial Haddop 2.3
Tutorial Haddop 2.3
 

More from Phil Cryer

Getting started with Mantl
Getting started with MantlGetting started with Mantl
Getting started with Mantl
Phil Cryer
 
Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolved
Phil Cryer
 
Moving towards unified logging
Moving towards unified loggingMoving towards unified logging
Moving towards unified logging
Phil Cryer
 
What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?
Phil Cryer
 
What if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usWhat if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of us
Phil Cryer
 
Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)
Phil Cryer
 
Online Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonOnline Privacy in the Year of the Dragon
Online Privacy in the Year of the Dragon
Phil Cryer
 
Is your data secure? privacy and trust in the social web
Is your data secure?  privacy and trust in the social webIs your data secure?  privacy and trust in the social web
Is your data secure? privacy and trust in the social web
Phil Cryer
 
Adoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity InformaticsAdoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity Informatics
Phil Cryer
 
Data hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity dataData hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity data
Phil Cryer
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
Phil Cryer
 
Taking your ball and going home
Taking your ball and going homeTaking your ball and going home
Taking your ball and going home
Phil Cryer
 
Building Toward an Open and Extensible Autonomous Computing Platform Utilizi...
Building Toward an Open and Extensible  Autonomous Computing Platform Utilizi...Building Toward an Open and Extensible  Autonomous Computing Platform Utilizi...
Building Toward an Open and Extensible Autonomous Computing Platform Utilizi...
Phil Cryer
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
Phil Cryer
 
Updates on the BHL Global Cluster
Updates on the BHL Global ClusterUpdates on the BHL Global Cluster
Updates on the BHL Global Cluster
Phil Cryer
 
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Phil Cryer
 
Biodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoBiodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles Demo
Phil Cryer
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
Phil Cryer
 

More from Phil Cryer (18)

Getting started with Mantl
Getting started with MantlGetting started with Mantl
Getting started with Mantl
 
Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolved
 
Moving towards unified logging
Moving towards unified loggingMoving towards unified logging
Moving towards unified logging
 
What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?
 
What if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usWhat if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of us
 
Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)
 
Online Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonOnline Privacy in the Year of the Dragon
Online Privacy in the Year of the Dragon
 
Is your data secure? privacy and trust in the social web
Is your data secure?  privacy and trust in the social webIs your data secure?  privacy and trust in the social web
Is your data secure? privacy and trust in the social web
 
Adoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity InformaticsAdoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity Informatics
 
Data hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity dataData hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity data
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
 
Taking your ball and going home
Taking your ball and going homeTaking your ball and going home
Taking your ball and going home
 
Building Toward an Open and Extensible Autonomous Computing Platform Utilizi...
Building Toward an Open and Extensible  Autonomous Computing Platform Utilizi...Building Toward an Open and Extensible  Autonomous Computing Platform Utilizi...
Building Toward an Open and Extensible Autonomous Computing Platform Utilizi...
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
 
Updates on the BHL Global Cluster
Updates on the BHL Global ClusterUpdates on the BHL Global Cluster
Updates on the BHL Global Cluster
 
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...Clustered and distributed
 storage with
 commodity hardware 
and open source ...
Clustered and distributed
 storage with
 commodity hardware 
and open source ...
 
Biodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoBiodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles Demo
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
 

Recently uploaded

Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
DianaGray10
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 

Recently uploaded (20)

Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 

Building A Scalable Open Source Storage Solution

  • 1. Building a scalable, open source storage and processing solution for biodiversity data Phil Cryer Anthony Goddard Thursday, November 12, 2009
  • 2. > Biodiversity Heritage Library's data  • all BHL storage is handled by the Internet Archive • 38,000+ scanned books • approximately 48 terabytes of data • unable to self-host Thursday, November 12, 2009
  • 3. > BHL - Europe • 3 year, EU funded project • 28 major natural history museums, botanical gardens and other cooperating institutions • third file-store of all BHL data • collecting cultural heritage from all over Europe Thursday, November 12, 2009
  • 4. > Data explosion • more data being created • more data being saved • more data tomorrow • storage has not kept up with Moore’s Law • this presentation will be saved online, more data! Thursday, November 12, 2009
  • 5. > Data explosion • more data being created • more data being saved • more data tomorrow • storage has not kept up with Moore’s Law • this presentation will be saved online, more data! Thursday, November 12, 2009
  • 7. > Problem 1 - Data access • file size we can’t store • latency of large files • quality user experience • processing data-mining Thursday, November 12, 2009
  • 8. > Problem 1 - Data access • file size we can’t store Access • latency of large files • quality user experience • processing data-mining denied... Thursday, November 12, 2009
  • 9. > Problem 2 - Copyright concerns © • international copyright concerns • potential related funding issues • we’d rather not let this be an issue Thursday, November 12, 2009
  • 10. > Problem 3 - Redundancy Thursday, November 12, 2009
  • 11. > Problem 3 - Redundancy • computers crash • hard drives die • networks fail • natural disasters occur Thursday, November 12, 2009
  • 12. > Problem 3 - Redundancy • computers crash • hard drives die • networks fail • natural disasters occur but... This is NOT a problem! Thursday, November 12, 2009
  • 13. ...so plan for it. Thursday, November 12, 2009
  • 18. > Site 1 - Internet Archive Thursday, November 12, 2009
  • 19. > Site 2 - MBL, Woods Hole Thursday, November 12, 2009
  • 20. > Site 3 - NHM, London ...followed by new Data center Thursday, November 12, 2009
  • 21. Data Centre – “Darwin Repository” • €600,000 Funding secured from eContentPlus • Suitable location found with very good development potential in collaboration with Science Museum. • Economy of scale provides additional avenues for co- development of services. – Disaster Recovery and Business Continuity for all Museums (help with ongoing and running costs) • DCMS funding sought to help with development. – e-Infrastructure European initiative • Building Digital Repositories for Scientific Communities – PESI (Biodiversity) Thursday, November 12, 2009
  • 22. Proposed Data Centre Location Swindon Wroughton Science Museum ©2008 Google – Imagery ©2008 DigitalGlobe, Infoterra Ltd & Bluesky, GeoEye, Map data ©2008 Tele Atlas Thursday, November 12, 2009
  • 23. Vendor Stakeholders / Partners • Identified Technology Partners* • Additional Funding Partners* *Note: Discussions are ongoing with all Partners and may be at different stages Thursday, November 12, 2009
  • 24. Long Term Sustainability • No Dripping Tap – Business case should provide for significant self funding opportunities. • Diversity – Darwin Repository (Data Centre) will provide an economy of scale that will provide significant efficiency gains. • Green technology to minimise carbon footprint and provide industry leadership. Thursday, November 12, 2009
  • 25. > Distributed storage • write once, read anywhere • replication and fault tolerance • error correction • automatic redundancy • scalable horizontally Thursday, November 12, 2009
  • 26. > Distributed storage - Options • fully hosted storage (cloud) • hosted with own storage (private cloud) • self hosted with proprietary hardware (Sun Thumper) • self hosted with commodity hardware Thursday, November 12, 2009
  • 27. > Distributed storage - GlusterFS • GlusterFS: a cluster file-system capable of scaling to several peta- bytes • open source software on commodity hardware • tunable performance • simple to install and manage • offers seamless expansion Thursday, November 12, 2009
  • 28. > Distributed storage - Archival  • Fedora-commons is an open source repository • accounts for all changes, so built- in version control • provides disaster recover • open standards to mesh with future file formats • provides open sharing services such as OAI-PMH Thursday, November 12, 2009
  • 29. > Distributed storage - Mirrored data • now we have redundancy • in fact, multiple redundant copies • provides fault tolerance • offers load balancing • gives us future geographical distribution Thursday, November 12, 2009
  • 30. > Now we have lots of computers... Thursday, November 12, 2009
  • 31. > Distributed processing • more abilities available than just storing data • with distributed storage comes distributed processing • distributed processing means faster answers • faster answers mean new questions • lather, rinse, repeat Thursday, November 12, 2009
  • 32. > Distributed processing • make your data more useful • image and OCR processing • distributed web services • identifier resolution pools • map/reduce frameworks • generate new visualizations, text mining, NLP Thursday, November 12, 2009
  • 33. > Distributed processing Finder TaxonFinder TaxonFinder TaxonFinder TaxonFinder Taxon WebService WebService Load Balancer Cluster Node Cluster Site Request Thursday, November 12, 2009
  • 34. > Some assembly required (optional) • our example uses new, faster commodity hardware • but it could run on any hardware that can run Linux • you could chain old "out dated" computers together • build your own cluster for next to nothing (host it in your basement) • solves some infrastructure funding issues • hardware vendor neutrality Thursday, November 12, 2009
  • 35. > Our proof of concept • we ran a six box cluster to demonstrate GlusterFS • ran stock Debian/GNU Linux • simulated hardware failures • synced data with a remote cluster • ran map/reduce jobs • defined procedures, configurations and build scripts Thursday, November 12, 2009
  • 38. www presentation REST API mod_glusterfs disco processing Hadoop Fedora Commons metadata Mulgara triplestore / rdf rsync GlusterFS sync support file system BitTorrent Ext4 (exabyte) HTTP ‘Network RAID’ Raw disk array Commodity SATA controllers storage Commodity Hosts Dedicated Storage Network Thursday, November 12, 2009
  • 39. > Distributed storage - Projected costs $246,000 Graph from Backblaze (http://www.backblaze.com) Thursday, November 12, 2009
  • 40. > Other avenues - Cloud pilot • BHL is participating in a pilot with New York Public Library and Duraspace • Duraspace would provide a link to cloud providers • pilot to show feasibility of hosting • testing use of image server, other services in the cloud • cloud could seed new clusters Thursday, November 12, 2009
  • 41. > Code (63 6f 64 65) • all of our code and configurations are open source • hosted on Google Code • get involved • join the mailing-lists • follow us on Twitter • ask questions, we'll help! Thursday, November 12, 2009
  • 42. > It’s your turn... • similar projects? • distributed services and processing? • where can this be best applied? • resilient services on top of storage • names processing? • LSID resolution pools? • image processing? • text-mining / NLP? • #biodiv webservices? Thursday, November 12, 2009
  • 43. Web: http://www.biodiversitylibrary.org/ Code, Support: http://code.google.com/p/bhl-bits Twitter: @BioDivLibrary (tag #bhl) Phil Cryer Anthony Goddard Missouri Botanical Garden MBLWHOI Library Biodiversity Heritage Library Biodiversity Heritage Library phil.cryer@mobot.org agoddard@mbl.edu http://philcryer.com http://anthonygoddard.com @fak3r @anthonygoddard Thursday, November 12, 2009