• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Updates on the BHL Global Cluster
 

Updates on the BHL Global Cluster

on

  • 1,196 views

Since last year’s TDWG we’ve taken this talk on the road, here’s an update on the BHL global cluster for those who were here last year, and an introduction for those who weren’t. We'll talk ...

Since last year’s TDWG we’ve taken this talk on the road, here’s an update on the BHL global cluster for those who were here last year, and an introduction for those who weren’t. We'll talk about reasons for needing the cluster, concepts and software developed to support the cluster (which is all available as open source software) and of course, the famous lessons learned.

NOTE: please click on the 'Notes' tab below the presentation for more detail on each slide.

Statistics

Views

Total Views
1,196
Views on SlideShare
1,196
Embed Views
0

Actions

Likes
2
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • PHIL/ANT Ohai! This is Phil, this is Ant.. Since last year&#x2019;s TDWG we&#x2019;ve taken this talk on the road, here&#x2019;s an update for those who were here last year, an introduction for those who weren&#x2019;t. <br />
  • <br />
  • ANT if anyone does not know, BHL is a digitization component of the Encylopedia of Life, a consortium of a global libraries and national history museums, it aims to digitize and share historic biodiversity literature texts, open access and free for all <br />
  • PHIL currently all BHL data is stored at IA, this is bad for many reasons - with our own cluster we can have control and new options on how we can use and store out data <br />
  • PHIL so how did we get from a proof of concept, put together with some various, outdated hardware <br />
  • PHIL to our formal production cluster <br />
  • PHIL with our first global cluster, our concept was to create a scalable storage system to store and server our data, that others *could* emulate, using open source software <br />
  • PHIL our systems run Debian Linux, using the latest filesystem, ext4, which supports far larger file systems (up to 1 Extabyte) and file sizes (16TB). We use the GlusterFS distributed/networked filesystem to handle the replication <br />
  • ANT cluster contains 6 boxes like this, 24 hard drives per, broken up and mirrored via a networked clustered storage. This gives us 216TB raw space, or 108TB of usable, mirrored data. <br />
  • ANT it&#x2019;s a cluster.. six boxes but we see it as one giant machine.. 64GB RAM, 100TB Hard drive, 48 processors <br />
  • PHIL just an example of a record type that we store, all of the derivative files of a book can range anywhere from 200MB to over 3TB. Here&#x2019;s an average record, and it&#x2019;s about 650MB, the size of a standard cd-rom (our mirror currently has over 80k such records) <br />
  • ANT we looked at different ways of transferring the files from Internet archive to our own cluster <br />
  • ANT after considering all the options, was decided to download the data from IA <br />
  • ANT this shows some of the downloading in progress (250MB/sec), all told we have downloaded 74TB so far - had some problems... <br />
  • ANT talk about the 1st, 2nd and 3rd one <br /> PHIL do the 4th, and the &#x2018;lessons learned&#x2019; <br />
  • PHIL parts of the code that did the initial download has been reworked to be an ongoing process, grabbyd will handle downloading new items from IA to the cluster weekly <br />
  • PHIL we currently have reporting to give updates on download progress, overall size of the data and transfer rates. This will be expanded as we go forward <br />
  • PHIL to keep various nodes in sync we&#x2019;ve written a backend &#x2018;open source&#x2019; Dropbox like server application. Using other software we can have a service listening for any changes and kicking off the syncing scripts <br />
  • ANT all of our code that we write is available as open source software, hosted on the BHL code repository <br />
  • ANT We have begun initial speed and sync tests within the US and to London, work will be starting on these tests to Australia shortly <br />
  • ANT the global aspect of BHL has become more clear now after last week&#x2019;s global meeting, with Egypt, Brazil joining others like China, AU, and EU <br />
  • ANT There are many options for syncing, due to the degree of control we require, we chose to use IA as a point of data ingestion, Woods Hole as a master site to seed data from <br />
  • ANT but, in the case of China, we ingest data into IA and then sync that data to our cluster in Woods Hole - so our model is flexible <br />
  • ANT In the case of BHL-Europe, content may or may not be ingested via IA, depending on the desire for BHL-Europe to take advantage of IA services such as OCR <br />
  • Phil there are other challenges such as deleting or content &#x201C;going dark&#x201D;, localization of content and especially how to deal with modified or annotated content <br />
  • Phil to track changes to the content we&#x2019;re using Fedora-commons, which provides access and management of digital content, and is a base to build other apps on to use the data in other ways <br />
  • Phil Fedora maintains a persistent archive, used for backup and disaster recovery for the files, compliment existing arch by using open standards and not requiring anything of the existing system. Offers more sharing options via OAI <br />
  • Phil as seen in the mix, Fedora runs independently of the system <br />
  • Phil while it could provide a conduit to share metadata about the archive <br />
  • Phil and can even talk 1:1 with another Fedora instance <br />
  • ANT we have this hardware, we are intending on making use of it for computation services such as taxon name finding and text mining. <br /> PHIL we have tested running Hadoop on our cluster, and work on running statistical jobs in R have been run in Missouri and we&#x2019;re looking to integrate the cluster for this <br />
  • PHIL/ANT in closing, while the BHL global cluster is to serve a certain purpose, we&#x2019;d like to highlight that anyone could cluster can be built a similar cluster in many ways, and for almost no money, contact us for any advice or assistance for this. Thanks <br />

Updates on the BHL Global Cluster Updates on the BHL Global Cluster Presentation Transcript

  • Updates on the BHL Global Cluster biodiversity heritage library anthony goddard phil cryer
  • Us? o We do this talk a lot.. generally our shirts match.
  • What is the BHL? • BHL - The Biodiversity Heritage Library o digitization component of the Encylopedia of Life o a consortium of a global partners o aims to share historic biodiversity literature texts o provide open access of all content o free for all
  • Why do we need a cluster? • All BHL data is at the Internet Archive in San Francisco o no redundancy o single point of failure (earthquake risk) o limited in how we could serve o no easy way to analyze data • First global BHL cluster gives us o redundancy o no single point of failure o various new serving options o new ways to run analytics #win!
  • Use Linux and open source software running on commodity hardware to create a scalable, distributed filesystem.
  • software
  • hardware
  • http://whbhl01.ubio.org/ganglia
  • # ls -lh /mnt/glusterfs/www/a/actasocietatissc26suom total 649M -rwxr-xr-x 1 www-data www-data 19M 2009-07-10 01:55 actasocietatissc26suom_abbyy.gz -rwxr-xr-x 1 www-data www-data 28M 2009-07-10 06:53 actasocietatissc26suom_bw.pdf -rwxr-xr-x 1 www-data www-data 1.3K 2009-06-12 10:21 actasocietatissc26suom_dc.xml -rwxr-xr-x 1 www-data www-data 18M 2009-07-10 03:05 actasocietatissc26suom.djvu -rwxr-xr-x 1 www-data www-data 1.3M 2009-07-10 06:54 actasocietatissc26suom_djvu.txt -rwxr-xr-x 1 www-data www-data 14M 2009-07-10 02:08 actasocietatissc26suom_djvu.xml -rwxr-xr-x 1 www-data www-data 4.4K 2009-12-14 04:42 actasocietatissc26suom_files.xml -rwxr-xr-x 1 www-data www-data 20M 2009-07-09 18:57 actasocietatissc26suom_flippy.zip -rwxr-xr-x 1 www-data www-data 285K 2009-07-09 18:52 actasocietatissc26suom.gif -rwxr-xr-x 1 www-data www-data 193M 2009-07-09 18:51 actasocietatissc26suom_jp2.zip -rwxr-xr-x 1 www-data www-data 5.7K 2009-06-12 10:21 actasocietatissc26suom_marc.xml -rwxr-xr-x 1 www-data www-data 2.0K 2009-06-12 10:21 actasocietatissc26suom_meta.mrc -rwxr-xr-x 1 www-data www-data 416 2009-06-12 10:21 actasocietatissc26suom_metasource.xml -rwxr-xr-x 1 www-data www-data 2.2K 2009-12-01 12:20 actasocietatissc26suom_meta.xml -rwxr-xr-x 1 www-data www-data 279K 2009-12-14 04:42 actasocietatissc26suom_names.xml -rwxr-xr-x 1 www-data www-data 324M 2009-07-09 13:28 actasocietatissc26suom_orig_jp2.tar -rwxr-xr-x 1 www-data www-data 34M 2009-07-10 04:35 actasocietatissc26suom.pdf -rwxr-xr-x 1 www-data www-data 365K 2009-07-09 13:28 actasocietatissc26suom_scandata.xml
  • initial population
  • the plan • Internet2 - woohoo o “This will take forever” (it took longer) o “We need more space” (not 24TB) o “something’s overloading the network” (oops) o “this checksum is wrong” (what the...) • Lessons learned would we do it again? Probably not.
  • code: grabbyd 1 Internet Archive, San Francisco BHL Global, Woods Hole
  • code: grabbyd_reporting http://cluster.biodiversitylibrary.org/
  • code: bhl-sync Open source Dropbox model inotify lsyncd OpenSSH rsync
  • all of our created code is open sourced and available at bit.ly/bhl-bits
  • http://bit.ly/09-bhl-sync
  • Replication | Replication
  • BHL content distribution 1 ? Internet Archive, San Francisco BHL Global, Woods Hole BHL China, Beijing 2 2 ? BHL, St. Louis BHL Europe, London BHL Australia, Melbourne
  • BHL content + local data Internet Archive, San Francisco BHL Global, Woods Hole BHL China, Beijing Content sourced from China, scanned by Internet Archive, replicated into BHL Global
  • BHL content + regional data Internet Archive, San Francisco BHL Global, Woods Hole ? BHL Europe, Paris BHL Europe, London BHL Europe, Berlin Content sourced from BHL Europe partners may, or may not, be passed back to Internet Archive and BHL Global
  • other replication challenges • deleting content - "going dark" • new content coming in from other sources (localization of content) • distributing modified content 
  • fedora-commons integration Repository platform • storage, access and management digital content • a base for software developers to build tools for sharing • free, community supported, open source software
  • fedora-commons integration Repository platform • storage, access and management digital content • a base for software developers to build tools for sharing • free, community supported, open source software • Maintains a persistent, stable, digital archive o provides backup, redundancy and disaster recovery o complements existing architecture by incorporating open standards o stores data in a neutral manner o shares data via OAI
  • BHL content distribution Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons BHL, St. Louis BHL Europe, London
  • BHL content distribution Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons OAI BHL node Fedora-commons
  • BHL content distribution Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons OAI BHL node Fedora-commons
  • computational services
  • thanks. anthony goddard phil cryer all code available bit.ly/bhl-bits presentation slides on slidesha.re/bhl-slides