Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Updates on the
BHL Global
Cluster
 biodiversity heritage library
   anthony goddard         phil cryer
Us?
      o We   do this talk a lot.. generally our shirts match.
What is the BHL?

  • BHL - The Biodiversity Heritage Library
    o digitization component of the Encylopedia of Life
    ...
Why do we need a cluster?

• All BHL data is at the Internet Archive in San Francisco
  o no redundancy
  o single point o...
Use Linux and open source software running on
commodity hardware to create a scalable, distributed filesystem.
software
hardware
http://whbhl01.ubio.org/ganglia
# ls -lh /mnt/glusterfs/www/a/actasocietatissc26suom
total 649M
-rwxr-xr-x 1 www-data www-data 19M 2009-07-10 01:55    act...
initial population
the plan
• Internet2 - woohoo
   o “This will take forever” (it took longer)
   o “We need more space” (not 24TB)
   o “so...
code: grabbyd




                                          1
        Internet Archive, San Francisco       BHL Global, Wo...
code: grabbyd_reporting




           http://cluster.biodiversitylibrary.org/
code: bhl-sync

        Open source Dropbox model

                   inotify

                   lsyncd

                ...
all of our created code is open sourced
       and available at bit.ly/bhl-bits
http://bit.ly/09-bhl-sync
Replication | Replication
BHL content distribution


                                    1                                    ?
  Internet Archive, ...
BHL content + local data



  Internet Archive, San Francisco           BHL Global, Woods Hole     BHL China, Beijing




...
BHL content + regional data



  Internet Archive, San Francisco       BHL Global, Woods Hole




                        ...
other replication challenges

• deleting content - "going dark"
• new content coming in from other sources (localization o...
fedora-commons integration
Repository platform
• storage, access and management digital content
• a base for software deve...
fedora-commons integration
Repository platform
• storage, access and management digital content
• a base for software deve...
BHL content distribution



  Internet Archive, San Francisco                    BHL Global, Woods Hole                   ...
BHL content distribution



  Internet Archive, San Francisco              BHL Global, Woods Hole                    Fedor...
BHL content distribution



  Internet Archive, San Francisco              BHL Global, Woods Hole                    Fedor...
computational services
thanks.
     anthony goddard                           phil cryer




                all code available bit.ly/bhl-bits
 ...
Updates on the BHL Global Cluster
Updates on the BHL Global Cluster
Updates on the BHL Global Cluster
Updates on the BHL Global Cluster
Upcoming SlideShare
Loading in …5
×

Updates on the BHL Global Cluster

1,380 views

Published on

Since last year’s TDWG we’ve taken this talk on the road, here’s an update on the BHL global cluster for those who were here last year, and an introduction for those who weren’t. We'll talk about reasons for needing the cluster, concepts and software developed to support the cluster (which is all available as open source software) and of course, the famous lessons learned.

NOTE: please click on the 'Notes' tab below the presentation for more detail on each slide.

Published in: Technology
  • Be the first to comment

Updates on the BHL Global Cluster

  1. 1. Updates on the BHL Global Cluster biodiversity heritage library anthony goddard phil cryer
  2. 2. Us? o We do this talk a lot.. generally our shirts match.
  3. 3. What is the BHL? • BHL - The Biodiversity Heritage Library o digitization component of the Encylopedia of Life o a consortium of a global partners o aims to share historic biodiversity literature texts o provide open access of all content o free for all
  4. 4. Why do we need a cluster? • All BHL data is at the Internet Archive in San Francisco o no redundancy o single point of failure (earthquake risk) o limited in how we could serve o no easy way to analyze data • First global BHL cluster gives us o redundancy o no single point of failure o various new serving options o new ways to run analytics #win!
  5. 5. Use Linux and open source software running on commodity hardware to create a scalable, distributed filesystem.
  6. 6. software
  7. 7. hardware
  8. 8. http://whbhl01.ubio.org/ganglia
  9. 9. # ls -lh /mnt/glusterfs/www/a/actasocietatissc26suom total 649M -rwxr-xr-x 1 www-data www-data 19M 2009-07-10 01:55 actasocietatissc26suom_abbyy.gz -rwxr-xr-x 1 www-data www-data 28M 2009-07-10 06:53 actasocietatissc26suom_bw.pdf -rwxr-xr-x 1 www-data www-data 1.3K 2009-06-12 10:21 actasocietatissc26suom_dc.xml -rwxr-xr-x 1 www-data www-data 18M 2009-07-10 03:05 actasocietatissc26suom.djvu -rwxr-xr-x 1 www-data www-data 1.3M 2009-07-10 06:54 actasocietatissc26suom_djvu.txt -rwxr-xr-x 1 www-data www-data 14M 2009-07-10 02:08 actasocietatissc26suom_djvu.xml -rwxr-xr-x 1 www-data www-data 4.4K 2009-12-14 04:42 actasocietatissc26suom_files.xml -rwxr-xr-x 1 www-data www-data 20M 2009-07-09 18:57 actasocietatissc26suom_flippy.zip -rwxr-xr-x 1 www-data www-data 285K 2009-07-09 18:52 actasocietatissc26suom.gif -rwxr-xr-x 1 www-data www-data 193M 2009-07-09 18:51 actasocietatissc26suom_jp2.zip -rwxr-xr-x 1 www-data www-data 5.7K 2009-06-12 10:21 actasocietatissc26suom_marc.xml -rwxr-xr-x 1 www-data www-data 2.0K 2009-06-12 10:21 actasocietatissc26suom_meta.mrc -rwxr-xr-x 1 www-data www-data 416 2009-06-12 10:21 actasocietatissc26suom_metasource.xml -rwxr-xr-x 1 www-data www-data 2.2K 2009-12-01 12:20 actasocietatissc26suom_meta.xml -rwxr-xr-x 1 www-data www-data 279K 2009-12-14 04:42 actasocietatissc26suom_names.xml -rwxr-xr-x 1 www-data www-data 324M 2009-07-09 13:28 actasocietatissc26suom_orig_jp2.tar -rwxr-xr-x 1 www-data www-data 34M 2009-07-10 04:35 actasocietatissc26suom.pdf -rwxr-xr-x 1 www-data www-data 365K 2009-07-09 13:28 actasocietatissc26suom_scandata.xml
  10. 10. initial population
  11. 11. the plan • Internet2 - woohoo o “This will take forever” (it took longer) o “We need more space” (not 24TB) o “something’s overloading the network” (oops) o “this checksum is wrong” (what the...) • Lessons learned would we do it again? Probably not.
  12. 12. code: grabbyd 1 Internet Archive, San Francisco BHL Global, Woods Hole
  13. 13. code: grabbyd_reporting http://cluster.biodiversitylibrary.org/
  14. 14. code: bhl-sync Open source Dropbox model inotify lsyncd OpenSSH rsync
  15. 15. all of our created code is open sourced and available at bit.ly/bhl-bits
  16. 16. http://bit.ly/09-bhl-sync
  17. 17. Replication | Replication
  18. 18. BHL content distribution 1 ? Internet Archive, San Francisco BHL Global, Woods Hole BHL China, Beijing 2 2 ? BHL, St. Louis BHL Europe, London BHL Australia, Melbourne
  19. 19. BHL content + local data Internet Archive, San Francisco BHL Global, Woods Hole BHL China, Beijing Content sourced from China, scanned by Internet Archive, replicated into BHL Global
  20. 20. BHL content + regional data Internet Archive, San Francisco BHL Global, Woods Hole ? BHL Europe, Paris BHL Europe, London BHL Europe, Berlin Content sourced from BHL Europe partners may, or may not, be passed back to Internet Archive and BHL Global
  21. 21. other replication challenges • deleting content - "going dark" • new content coming in from other sources (localization of content) • distributing modified content 
  22. 22. fedora-commons integration Repository platform • storage, access and management digital content • a base for software developers to build tools for sharing • free, community supported, open source software
  23. 23. fedora-commons integration Repository platform • storage, access and management digital content • a base for software developers to build tools for sharing • free, community supported, open source software • Maintains a persistent, stable, digital archive o provides backup, redundancy and disaster recovery o complements existing architecture by incorporating open standards o stores data in a neutral manner o shares data via OAI
  24. 24. BHL content distribution Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons BHL, St. Louis BHL Europe, London
  25. 25. BHL content distribution Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons OAI BHL node Fedora-commons
  26. 26. BHL content distribution Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons OAI BHL node Fedora-commons
  27. 27. computational services
  28. 28. thanks. anthony goddard phil cryer all code available bit.ly/bhl-bits presentation slides on slidesha.re/bhl-slides

×