Good morning everybody. I want to switch gear slightly and tell you about some work I have been doing with my colleagues in Entomology to speed up the rate of digitization. With the exception of the BHL project what we have heard about are mostly small-scale projects looking at the digitising pockets of the NHM collection. These are mostly project driven efforts, digitising on average few thousand specimens. I’d argue that although these projects are useful, especially for people wanting these data, we need to take a more industrial approach to the problem of digitisation. And I come to this conclusion based on two observations.
The first is based on what last years HoL Science and Technology Committee said about digitisation in their report on the state of taxonomy and systematics. They said, and this is a direct quote “Th e rate of progress by the UK taxonomic institutions in digitising and making collections information available is disappointingly low … there is a significant risk of damage to the international reputation of major institutions such as The Natural History Museum ”
My second observation is that the HoL were absolutely right one both counts. Last year Graham Higley put together a cross departmental group to look at digitisation efforts across the NHM. They got together data from all the departments and looked at rate at which meta data was being digitised (so that’s things like collecting data of specimens labels), and the rate at which specimens were being digitised (in other words imaged) in various ways. *** From this they calculated that at present rates it would take 900 years to get the data off the collection, and 500 years to take the pictures. Now I don’t know about you by if you believe that mass digitisation efforts are useful, and for reasons I’ll come to I’m one of those people that do, then I’m not prepared to wait that long. More to the point I’m pretty certain our funders won’t and even more importantly the people that might make use of this information (if they knew it was there) won’t wait either.
Perhaps one of the reasons why we are rather constrained in out thinking about digitisation is because of our natural focus on specimens. The shear magnitude and effort of individually handling the 70 million plus specimens in order to digitise then is enough to put anyone off, especially when we have some many other priorities, and when we are not entirely clear why we’d undertake this in the first place. ***But the truth is that most of out specimens in grouped in such a way that makes then much easier to handle and in such a way that they are on display. For example, in entomology, although we have 28 million specimens, most of them are held in draws, and we only have 135,000 of them. ***If there was a way in which we could digitised these draws, ***and if we can get sufficient information no only to see the specimens, but also perhaps to get taxonomic data from specimens, then perhaps the task of digitising the collection wouldn’t seem so great.
What I want to tell you about is a piece of prototype equipment we have been testing in the Sackler Image Lab for the past couple of months that will do just this. It is produced by a company called SmartDrive based in Cambridge and is a combination of hardware and software that provides automated capture of lower resolution images, which are then assembled ( st itched) into a larger panoramic image, generating an extremely high resolution final image. A telecentic camera with the attached lens is moved in two dimensions along precision rails positioned above the imaged object. This method maximizes depth of field of the captured images and minimizes distortion and parallax artifacts. The best way of understanding this is not by me explaining it but by you seeing this in action, and I have a short movie that demonstrates this.
This is the equipment based in the Sackler lab, and what Natalie is doing is placing a specimen draw in the machine. These are some swallowtail butterflies I think. She then sets the machine off from its starting positing and it begins capturing images. What’s happening under the hood is that the camera and lens are moved along precision rails at the top, and at each point they capture an image. Each of the original images is 1280 x 960 pixels. The images are tiled together on the computer and that is what you can see on the screen. It takes about 5 minutes to do a typical sized entomology draw, although for some of the larger draws it can take up to 7 minutes. Each of images are then stitched together by the computer giving a final images of up to 21000 by 21000 pixels. That’s roughly 35 pixels per mm.
Here are some example outputs from the machine. This close up here corresponds to the tiny white patch on the wing. In actual fact the area is smaller that the white patch but for some reason I couldn’t get PowerPoint to make a rectangle small enough. Just to be clear, the structures that you can see in this close up are not pixels - these are individual scales on the butterfly wing. Of course butterfly’s are quite large so let me show you dome smaller objects. This is a draw of fungus gnats. We have just mounted these images on the web using the Zoomify plug-in that allows you to zoom in to very large images. As you can see, the images retain taxonomic information at the maximum zoom level and are still not pixelated. As a second example here is a draw of leaf footed bugs (Squash bugs). Again the images retain a high level of resolution and depth f field. In otherwise the specimens and images are in focus.
SatScan have leant us the equipment for about a month to run some trials that we have been running of various entomological, botanical and palaeoentomological parts of the collection. The goal of these trials was to assess utility for collection management and research, and to work with the company to understand technical & practical limitations of the machine, and look at options for how it could be improved. In this short trail we managed to digitise about 500 draws. Key facts from this work are that the minimum resolvable structure depending on the precise aperture and exposure settings is 0.06-0.1mm. Just to put that into perspective this means that about 65% of the specimens in the entomology collection here could be usefully digitised at this resolution. The system (again depending on the exact aperture and exposure settings) gives a very high depth of fields. In other words objects like the specimen and the label (when it is not obscured) all stay in focus if they are between 10 and 80mm. The file sizes are actually relatively small when you consider the size of the draw. They are about 300-500 MB as a compressed TIFF image, which sounds a lot but really isn’t too bad. Scanning time for a typical draw is 5-7 minutes. This means that a single operator could do about 60 draws a day. In addition you have to take into account the stitching time. This can take 5-10 minutes depending on the size of the draw, meaning you can stitch images for about 90 draws in a 12-hour period. However this whole process can be batched so it runs overnight – there is no need to be present while it is happening.
This slide gives an overview of the relationship between the size of the aperture and the exposure time, which affect the depth of field and the size of the smallest resolvable structure. TO save time I’m not going to go into detail here, suffice to ay that with this camera combination you can resolve structures downs to about 56 microns (that 56 thousandths of a millimetre), but you only get a depth of field of 6mm. [Any photographer will know that there is a trade off with the size of the aperture on the lens and the length of exposure on the camera, which affect the depth of field and the level of resolution. The wider the aperture the shorter the exposure. This narrows the depth of field but increases the resolution. In this case the smallest resolvable structure we could resolve was 56 microns, but this just gives us a dept of field of 6 mm. At the other end if you close the aperture down but have a longer exposure we could achieve 8 cm of depth of field but resolve less on those specimens. The implications of this is that if you have a try of very small insects and you want to achieve a high resolution, they all need to be within the depth of field of 6mm. Conversely if you had lager insects in the tray and could tolerate a lower level of resolution, the tolerance on the depth of field is much higher (8cm). We used a Basler 1/2&quot; CCD chip camera and an Edmund Optics 0.16x telecentric lens.]
What are the implications for all this. Well at a general level, the systems is best suited to drawers of numerous, uniformly positioned, medium sized specimens. For example, it gives excellent results with large and medium-size beetles, moths and butterflies. At this level of resolution sufficient information is usually preserved to allow identification, oftn to species level, for these specimens. Objects less than 10 mm could not be imaged so adequately, although such images could be used in other ways, and I’ll come on to possible uses in a moment. Another key point is that specimen labels and barcodes (when not obscured by the specimens) could be easily read from the digitised image. Witin entomology this more specifically means that of the 135,000 draws in the department., 85,000 could be usefully imaged at the current level of resolution with this system. This work could be completed in ~2024 person-days (ten person-years) using one system. Its worth noting that other lens / camera options might be explored to image remaining draws at a higher level of resolution.
It won’t have escaped your attention that there are some downside with this approach. In fact I think there are three major issues we’d need to consider when evaluating it utility in the NHM. The first issues is one about metadata. This is such a big issues that I’ll consider it in a separate slide. The next major issues is the utility of surface (usually dorsal) view images - not a panacea. There are plenty of parts of our collection where have surface vies of specimens simply isn’t that useful. For example many mineralogical specimens or palaeontological specimens have most of their information locked away inside them. Of course one might make the same point about the many other kinds of data (molecular, X-ray, chemical data), which is simply not accessible, though images. The third major issue is that to make this process useful we would need to assigning specimen level identifiers to the objects we image. These can be physical labels, like barcodes, electronic labels actually on the images – and possible both – I’ll cover this on the next slide. Another consideration is the space required to store all these images. If we are going to store 85k stitched images that equals about 28,222 GB or 27.6TB, which sounds a lot, but in this day and age really isn’t that much, especially when you consider the effort it represents. To make this system useful we need to make sure we develop the software to manage the workflow of processing the images. Likewise we have to integrate this with our existing systems like KeEMu and DAMS system that Ailsa will talk about. Finally, and perhaps most importantly, if we are to embrace this process as part of our work, I has implications for the way we use the collection for research & collection management processes, particularly in terms of things like staff time and general curation activities. Another point, although it is actually pretty trivial when you consider the size of the other points is the cost. This is circa ｣ 5 0k (for outright purchase) or ｣ 2 k per month hire. There are afew outstanding issues to do with the hardware and software of the system. Max. scanning area ~ 500 x 600 mm – insufficient for some drawers; occasional errors during scanning and stitching; focusing (currently time consuming); inconvenient access to scanning area.
I want to go back to this issue of metadata capture since this is the point that is perhaps most controversial about this approach. My first point is that metadata capture is the rate-limiting step. If you remember on the second slide I showed, we established that at present rates it will take about 900 years to capture all the metadata from NHM specimens at current rates. This machine doesn’t directly changes this fact. However I do want to make a few points about metadata capture that are important here. Firstly specimen images & metadata need not be captured together. But if you don’t do it at the same time (and arguably even if you do) at the very least you need a way of linking them back together at a later date, and you do this linking though hared identifiers. In other words having the same number of the specimen and the image so you can link the two back together. These specimen level identifiers might be physical, virtual or both. Assignment of virtual identifiers might be automated (though this requires some investigation). More likely we would prioritise metadata capture based on research & collection activities. Images are easy to get with this system and we can image and re-image as required. We might think about more innovative ways of capturing the metadata and assignment of identifiers and image cropping – for example through crowd sourcing the problem – though I don’t have the time to go into this here.
In fact the seperation of captureing the metaddata, assigning identifiers and imaging the specimens was exactly what we berved when we opend the system up for others to use. This is a volunteer workingon the British Lichins collection and she is adding barcodes to the specimens, noting metadate about the draw (not the individual specimens) and then imaging the lot - all very quickly.
So what are our next steps with this system? Well out main goal is to set up a larger scale project to address the NHM issues we might have about using it, and I have just repeated the key issues here. At that point I’ll stop but before I do I want to acknowledge the help of Smart drive Ltd (especially Mike Broderick & Dennis Murphy), and particular for their free loan of the system while we explore how we can make use of it. Without their help and their innovative work on the system, none of this would be possible. Thanks very much.
Scaling-up collections digitisation
Scaling-up collections digitisation Vincent S. Smith Vladimir Blagoderov, Ian Kitching & Thomas Simonsen
“ the rate of progress by the UK taxonomic institutions in digitising and making collections information available is disappointingly low… there is a significant risk of damage to the international reputation of major institutions such as The Natural History Museum ” House of Lords Science and Technology Committee Report on Taxonomy and Systematics, 2009
Example outputs Diptera: http://sciaroidea.info/node/44309 Coreidae: http://sciaroidea.info/node/44310
Sackler Lab Trials Nine test projects over 1 month (ent. bot. & palaeoent.) - Assess utility for coll. management and research - Understand technical & practical limitations <ul><li>Key Facts </li></ul><ul><li>Minimal resolved structures: 0.06 - 0.1 mm </li></ul><ul><li>Depth of field: 10 - 80 mm </li></ul><ul><li>File size (15000 x 14000): 340Mb (TIFF) </li></ul><ul><li>Scanning time (45 x 50 cm): 5-7 min, depending on exposure </li></ul><ul><li>Stitching time, 200-400 tiles: 5:30-9:30 min (batchable, overnight) </li></ul>
Sackler Lab Trials Aperture, Exposure, Depth of Field & Resolution 11 810 41 Exposure (ms) DoF (mm) 6 80 17 Smallest resolvable structure ( µ m) 56 98 59 Open Closed Midway Aperture
General points Implications Entomology dept. <ul><li>Best suited to drawers of numerous, uniformly positioned, med. size spec. </li></ul><ul><li>Excellent results with large and medium-size beetles, moths and butterflies </li></ul><ul><li>Sufficient information is usually preserved to allow id. for these specimens </li></ul><ul><li>Objects less than 10 mm could not be imaged so adequately </li></ul><ul><li>Such images could be used in other ways </li></ul><ul><li>Specimen labels and barcodes (when not obscured) could be easily read from the digitised image </li></ul><ul><li>Of the 135,000 draws in Entom., 85,000 could be usefully imaged at the current level of resolution with this system </li></ul><ul><li>This work could be completed in ~2024 person-days (ten person-years) using one system </li></ul><ul><li>Other lens / camera options might be explored to image remaining draws </li></ul>
Caveats <ul><li>Metadata </li></ul><ul><li>Utility of surface (usually dorsal) view images - not a panacea </li></ul><ul><li>Assigning specimen level identifiers (physical, virtual or both) </li></ul><ul><li>Image storage (85k stitched images = 28,222 GB or 27.6TB) </li></ul><ul><li>Software workflow (managing identifiers, cropping etc) </li></ul><ul><li>Integration with existing systems (KeEMu and DAMS) </li></ul><ul><li>Challenges to research & collection management processes (e.g. staff time, curation activities) </li></ul><ul><li>Cost: Circa £50k (outright purchase) or £2k per month hire </li></ul>NHM Issues <ul><li>Max. scanning area ~ 500 x 600 mm – insufficient for some drawers </li></ul><ul><li>Occasional errors during scanning and stitching </li></ul><ul><li>Focusing (currently time consuming) </li></ul><ul><li>Inconvenient access to scanning area </li></ul>Hardware / Software issues
Metadata capture is rate limiting <ul><li>Specimen images & metadata need not be captured together </li></ul><ul><li>Link back together through common identifiers </li></ul><ul><li>Specimen level identifiers can be physical, virtual or both </li></ul><ul><li>Assignment of virtual identifiers might be automated </li></ul><ul><li>Prioritise metadata capture on research & collection activities </li></ul><ul><li>Image and re-image as required </li></ul><ul><li>Crowd source metadata capture, assignment of identifiers and image cropping </li></ul>
<ul><li>Acquiring images for use with automated identification software </li></ul><ul><li>Manual identifications </li></ul><ul><li>Morphometric analysis of specimens </li></ul><ul><li>Support the monitoring of environmental change </li></ul><ul><li>Supporting biodiversity conservation research </li></ul><ul><li>Studies on colour pattern variations </li></ul>Possible Applications <ul><li>Accurate specimen counts for the entire collection </li></ul><ul><li>Collections audit and security </li></ul><ul><li>Improving accessibility to the entire collection </li></ul><ul><li>Saving curator & visitor time </li></ul><ul><li>Improving curation </li></ul><ul><li>Updating identifications (crowdsourcing possibilities) </li></ul><ul><li>Encouraging typification (discovery of unrecognized/unlabelled types) </li></ul><ul><li>Populating KE EMu </li></ul><ul><li>Visual & engaging equipment on display in Sackler Lab. </li></ul><ul><li>Innovating crowd sourcing possibilities with the public </li></ul><ul><li>Meets NHM strategic commitments on collection accessibility </li></ul>Collection management Research Public engagement
Next Steps… <ul><li>Metadata </li></ul><ul><li>Utility of surface (usually dorsal) view images - not a panacea </li></ul><ul><li>Assigning specimen level identifiers (physical, virtual or both) </li></ul><ul><li>Image storage (85k stitched images = 28,222 GB or 27.6TB) </li></ul><ul><li>Software workflow (managing identifiers, cropping etc) </li></ul><ul><li>Integration with existing systems (KeEMu and DAMS) </li></ul><ul><li>Challenges to research & collection management processes (e.g. staff time, curation activities) </li></ul><ul><li>Cost: Circa £50k (outright purchase) or £2k per month hire </li></ul>Larger Scale Project to address NHM Issues Acknowledgements <ul><li>Smart drive Ltd (esp. Mike Broderick & Dennis Murphy) </li></ul>http://sciaroidea.info/sites/sciaroidea.info/files/SatScanTrialReport.pdf