SIL rapid capture


Published on

More information than you require about the Smithsonian Libraries' mass digitization program. Presentation given to Smithsonian staff (and some others) for a day-long symposium on rapid capture methodology. Focuses on SIL's workflow for scanning books.

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • History: scanning since 1999. create “digital editions” whole books delivered via website, scan using betterlight, tiffs, convert to jpg. Store on gold cds! And tivoli. Metadata entered via cut and paste into spreadsheets. Beginning – html pages one per book page! Then use database driven pages. Each book scanned was a unique project. Some projects had grant funding, some didn’t. End result was not stored in a content or collections mgmt system, just on the website.
  • To increase volume, you must standardize - what metadata is collected, etc. try to accommodate most things you’ll scan, but inevitably one size won’t fit all. Figure out what you’re willing to compromise on and live with. Format based = books one way, photos another, audio another.Automation for efficiency and speed, staffing for consistency, quality control, and speed.You don’t necessarily need one huge funding source, but you do need a stream of funding. More than project based, but not the whole enchilada ncessarily. Leverage that as proof of concept for funding for other parts of the collection, OR for funding additional services/feature Overlapping grants, creative redeployment of existing resources, project-within-a-project funding
  • SIL’s rapid capture methodology based on one large project (BHL) and it’s needs. We then Extend the model from there.Justone way of approaching it We had initial grant for digitization, supplemented with two more. More will need to come.We use funding primarily for STAFF, then for vendor/outsource, then for equipment/software.Process has taken a couple years to standardized. Couldn’t have standardized and rapidized process without the automation.
  • Catalyst for our Ramp-up came In 2008 (or thereabouts) Smithsonian Lib and MoBot spearheaded the creation of BHL. primary audience was the international taxonomic community, we had plenty of collections that were relevant. We are primarily scanning from our NH collections, as well as Cullman rare book collections. Those make up only n% of the total SIL collections, but it’s a significant % of our public domain holdings.Ramp up was necessitated by terms of the grant!
  • Over 14,500 items and 5.8m images scanned since 2008. Mostly via Internet Archive (BHL only)Our other scanning project since 2010 over 1900 items and 600,000 pagesRamped up VERY QUICKLY. Sending 200 items a week for scanning. Needed to spend out funds, BUT quality suffered. Shipments started failing QC, so we scaled back. Fewer problems now.Rapidity – function of non-destructive scanning, care with fragile/rare, QC TAKES A LONG TIME, but saves rescanning later.Averaging ~ 4000 images/month locally, IA avgs 104,000 imgs/monthStorage: (est. 600MB per package, zipped compressed lossy jp2s etc) at IA = over 10TB (? 8.3TB BHL + 1.2 TB? SI )Storage locally since 2011: avg. pkg size is 23.4GB, more than 4.5TB. Saving tiffs, jp2s]
  • You don’t necessarily need one huge funding source, but you do need a stream of funding. Overlapping grants, creative redeployment of existing resources, project-within-a-project fundingInitial BHL digitization costs paid from MacArthur grant to EOL/BHL – only covers scanning <$500,000 (will scan approx. 17,000 books, out of over 50,000 likely to scan for that project) Rough calculation figured total cost to scan entire (BHL) collection (by IA, which is cheap) would cost over $2.5mFunding of personnel and equipment from multi-year overlappingSeidell grants (1.5m over 7 years)Expanding scanning to other parts of the collection by setting aside special purpose funds (director’s discretionary) for both people and scanning.Future…? Gradually incorporate tasks into permanent staff tasks/refill positions judiciously.Seekspecific grants for special parts of collection or special use cases
  • Most imp use of fundsFull time.Feed the beast.Manage coordinate workflow, also do qc, post-scanning maintenance of online collectionBHL project evolving, workflow more settled now, need libs not techsNote that the librarians do more than manage the digitization. metadata issues are now usually routed through our contract cataloging process and also use grant funds.
  • IA: Quick, cheap, open accessDownsides: size limit, public domain only, quality spottyIn-house: quality, controlDownsides: slower, more expensive, STORAGESpeed may be less of a factor once the NEW CAMERA comes online
  • Gory details:shoot target at beginning of the book only, calibrate (mostly white balance) once per book> always shoot greater than 300ppi, relative to the size of the book > Shoot in 16 bit color, Adobe 1998 RGB color spaced When imgs are converted to .tiff, ownsample to 8bit color and standardize on 300ppi (space issues) > apply auto-contrast and auto-levels but no other image editing in CaptureOne, maybe some sharpening if needed. Capture one does filenaming, crop, rotate, and convert to tiffQC done as first pass right after scanning for all items, by the scanner. Second QC is done by other staff on a selected number of items, based on a formula (NISO standard!). QC is looking only for ‘major’ errors such as missing pages, thumbs in picture, cut off text – anything that would adversely affect the OCR. We are concerned only with the “content” since this is an ACCESS copy not book as ARTIFACTAfter scanning, the operator manually moves the files onto the Macaw server, into the directory already created (name convention is barcode, same as filenames)
  • Digitization can happen anywhere. Multiple vendors, in house, legacy stuff you scanned way back. Small grants, Special projects, main mass-digi stream, extraction of pretty pictures for reuse.Bulk for us done by IA – cost & grant driven for BHL, but they can’t do everythingAll the various workflows=BLUE SPAGHETTI BARFHard to track, stuff everywhere, doesn’t scale (duh) need to refine processes and standardize and harmonize small-scale projects with large-scale project
  • Basic workflow. Key elementsItem level metadata & workflow tracking dbSIRIS as official metadata repositoryIA as staging (and temporary storage) area
  • Use IA as staging for convenience – already used by BHL project. Plenty of storage space, they do OCR and create derivatives for us, plus, available for everyone on IA.Accept common basic metadata model (for book format) based on BHL/IA model. Suits most things.Still to solve: storage, presentation, non-IA compatible stuff (e.g in copyright)However, creating metadata and uploading to IA would be a time intensive manual processTo be efficient must AUTOMATELocal scanning needed tool to upload to IA, create metadata = Macaw
  • Use & reuse data you already have. Find protocols to extract data you have. HOW? Through MACAW! For us, can get title level MARC data from SIRIS via Z39.50. Item level data not as accessible, so extracted it in bulk, stored in sep db that we use for workflow, Macaw then automatically harvests it from that db when necessary. Macaw transforms harvested data to xml.Descriptive Pg level data still entered by hand, but technical mtdt (image size) extracted automatically, transformation to xml automated.Also automate transformation tiff->jp2, bundling and uploading of locally created files to the IA staging area. (easier said than done)
  • When a book is selected for scanning in the Workflow database,Macaw (which checks it every couple hours) imports the item-level data (barcode, volume etc) and creates a directory on it’s server to hold the metadata and scans. It then imports the MARC record from SIRIS via z39.50 and converts it to MARCxml, saved in a file. The item-level data is stored in a database.When the scanner moves the scanned images to the directory, Macaw creates thumbnails for use in the interface.
  • Operator scans barcode for item and is taken to the editing page.Add page level metadata (page type, page number) and structure (page sequence). Stored in an xml file.Easy to use GUI, shortcuts to common operations, like selecting alternate pages to apply recto/verso and page type descriptions. Can re-order pages, esp useful if you’ve scanned all the rectos then all the versos.ClickContains extra fields that we can use locally for other projects – add captions, notes, flag for ‘interestingness’ e.g. blog post or etc.Once book is “finished” unless it is flagged for QC by other staff, Macaw creates the page level xml file, converts the .tffs to lossy compressed jp2s, zips the compressed jp2s, and sends the entire metadata+scans package up to Internet Archive which is a lot easier to say than it is to do. Also copied locally to NAS for temporary storage.
  • IA scanning for “Access” only. Not Preservation. Managing expectations. Color and calibrationCurrent equipment still slow to setup/handle oversize materialsNot embedding descriptive metadata in page images. Need to automate this. Send to dams/other.
  • Thruput: new camera should help, MSS and un-cataloged items need software tweaks for the metadata, also need to develop auto export to local storage, aka DAMS. Starting to repurpose images already (import directly into our galaxy of images collection) but hope to integrate into online exhibition workflow involving DAMs and ? Who knows. Output to Mets for storage. Not thinking about PREMIS just yet. Islandora for storage and/or delivery of METS based docs. Need to harvest back scans and metadata from multiple locations so we can manage corrections, storage (fault lines!) possible replication of BHL corpus.Interface as part of new digital library
  • SIL rapid capture

    1. 1. “MORE”More information on the SIL digitization program than you require Keri Thompson Smithsonian Institution Libraries SPIN Rapid Capture Workshop February 16, 2012
    2. 2. Boutique Digitization Boutique  One-offs  Item-based workflow  Tailored metadata  Hand-crafted data, much user intervention  Opportunistic staffing  Project specific grants Illustration by A.E. Marty (1882-1974) Gazette du Bon Genre, July 1920 Smithsonian Institution Libraries
    3. 3. Mass DigitizationPrêt à lire Standardization Format-based workflow and metadata model Automate as much as possible Assigned staff Funding stream New York Millinery and Supply Co. , 1901 Smithsonian Institution Libraries
    4. 4. Ramping Up Find your niche Secure Funding Hire Staff Purchase Equipment Standardize on metadata, processes Automate!  i.e., find magic automation wizard
    5. 5. Our Little Corner of the Web  10 original partner institutions  Digitizing legacy literature of taxonomy  Over 50,000 titles, over 100,000 items, almost 38 million pages
    6. 6. Numbers! Digitization at SI Libraries Storage estimates 1999-present14000 At Internet12000 Archive >10TB10000 too rapid 8000 rapid 6000 not rapid 4000 2000 0 Locally >7.5TB Total Items
    7. 7. Funding Multiple grants Over multiple years Lather, rinse, repeat Kalamazoo Tank & Silo Co. Catalog, ca. 1909 Smithsonian Institution Libraries
    8. 8. Human Resources Started in 2008 with  2 FTE technicians (Grant)  .7 FTE manager  .5 FTE cataloger  Vendor scanning only  And a host of others! In 2012 have  1 FTE technician (Grant)  2 FTE librarians (Grant) International Time Recording Co.  .3 FTE manager Time Recording Card Clocks , 1914 , p.12 Smithsonian Institution Libraries  1 scanning technician (Grant)  And a host of others!
    9. 9. Canon 5D MkII, BiblioPhaseOne P65, CaptureOne BC100,CaptureOne Equipment
    10. 10. In-House Scanning  P65, 60.5MP camera  Strobe lights  Image capture  Filenaming  Crop, rotate  No post-processing  Convert to .tiff
    11. 11. Process(es)(es) presentation Data sources Website“gap-fills” Vendor Requests storageIn-house use(exhibitions, br Specialochures) projects
    12. 12. Workflow Mark as DB scanned SIRIS Title level Item level MARC URLs in MARC record Initiate metadata workflow Item Select & Check out Check in Check in Scanning available Dedupe and Ship and QC Add link in IA/BHL JP2000s + metadata Harvest to Local Internet Repository ArchiveGeneralized workflow
    13. 13. Standardize Process and Data Common staging area Metadata Model  Title level (MARC) metadata  Item level metadata  volume, issue, date, barcode  Page level metadata  sequence, page number, page type Common storage area Common presentation area Ericsson LM, Can Efficiency be Measured? Stockholm, Sweden, 1946 Smithsonian Institution Libraries
    14. 14. Automate Metadata Capture & Transformation  Extract title level metadata  MARC  MARCXML  Extract item level metadata  From SIRIS  SQL db  xml file  Page level metadata  Interface for easy data entryNational Cash Register  File creation and conversionAnnual Report, 1953Smithsonian Institution Libraries  Upload to staging area
    15. 15. Workflow Mark as DB scanned SIRIS Item level Title level metadata MARC URLs in MARC recordInitiateworkflow Creates Item Select & Check out Check in Check in metadata Scanning available Dedupe and Ship and QC Add link “Bucket” in IA/BHL Transforms Images, cr .tiffs eates Macaw derivatives Page level Temp. metadata Backup to added NAS Packages JP2000s files for + metadata transfer Internet Archive In-house workflow with Macaw
    16. 16. Metadata Collection and Workflow (Macaw)
    17. 17. Room for Improvement Quality Speed Embed metadata Kenwood Bicycle Mfg. Co. Catalogue for 1895 , 1895 Smithsonian Institution Libraries
    18. 18. Future Increase throughput Scan non-book items (MSS) Scan un-cataloged items Frictionless repurposing Output to METS Islandora Local delivery interface Collier’s, October 18, 1952 Smithsonian Institution Libraries
    19. 19. Thank You!THAT IS ALL. Keri @DigiKeri_SIL