Experience with Ingestion of Large Collections
Stuart Kenny
Research IT
Trinity College Dublin
Stuart Kenny
Research IT
Trinity College Dublin
The Fairy Tales of Charles Perrault. Illustrated by Harry Clarke.
Intro. Thomas Bodkin. London: George G. Harrap, [1922].
Internet Archive version of a copy in the New York Public Library.
Web. 25 December 2012.
My what a big collection you
have!
About DRI (https://repository.dri.ie/)
● DRI is an interactive trusted digital repository for
contemporary and historical, social and cultural
data held by Irish institutions
● RIA (lead), NUIM, TCD, DIT, NUIG, NCAD
● Partners: academic, cultural, social, government
Outline
• What’s our problem?
• Example collections
• Ingest solutions
• Current ingest process
• Possible future process
Ingesting Objects
• Ingest form
o Suitable for single
objects/small collections
o Flat hierarchies
o Simple metadata standards
• Multiple standards
o e.g., MARC, EAD
o XML upload
• How to handle complex
standards, many
objects?
Example Collection: Clarke Stained Glass
• MODS metadata
• 10,025 objects
• 42 sub-collections
• 20,047 files, 2.82 TB
• Problems:
o Large number of objects
o Data transfer
Example Collection: TCD Children’s Books
• MARC metadata
• 207,889 objects
• 16 sub-collections
• Problems:
o Large number of objects
o Very slow to ingest
o Timeouts and errors
Example Collection: Kilkenny Design Workshop
• EAD metadata
• 2,040 objects
• 2,734 series/files
• 2,231 files, 1.2GB
• Problems:
o Very complex metadata standard
o Hierarchical structure
EAD, and why I don’t quite hate it as much as I did...
• Single XML file upload
• Structure encoded in metadata
• URLs to files
• But
o One-shot ingest
o How to edit/update?
o Slow to ingest
o Requires a lot of resources
Sufia Batch Upload
• Add multiple files
• New work for each
• Metadata for each
work
• How to handle
multiple standards?
• Different metadata
for each work?
Avalon Batch Ingest
• Ingest package
o Manifest file
o Plus content files
• Manifest file is spreadsheet
o Metadata for items
o Names of content files
• Ingest package uploaded to Avalon DropBox
Approach up to now
• Command line client
o Enter text commands at ‘command prompt’
• Written in Ruby
• Run locally by user
• Metadata and asset files arranged in fixed directory structure
• Client iterates over directory creates each object as single
ingest
Problems
• Lack of user familiarity with command line
• Multiple platform support
o i.e., Windows
• Difficulty of installing
• Multiple single ingests
o Slow
o Error prone
• Required lots of user support
• Mostly in the end ingests performed by dev team
Current Attempt
• Web-based UI
• Borrow heavily from Avalon approach
• Upload metadata XML plus assets to online storage
• Add manifest spreadsheet
o Each row contains path to metadata
o Paths to zero or more asset files
o Paths relative to online storage directory
• Backend processes manifest and ingests as background task
• UI updates status
Current Attempt
UI
Online
Storage Repository
Select
manifest
Retrieve
remote
files
Ingest
Update
status
• Hydra BrowseEverything
o Gem to access cloud storage
o DropBox, Google Drive…
• User uploads files
• In UI selects collection
and manifest to ingest
• Everything handled
server side in
background
• Can view status in UI
Outstanding Issues
• Online storage
o Dropbox type storage size limits
• Creating spreadsheet less easy than directory structure
• Possible solutions
o Provide online storage
o Has to be per user
o Generate required manifest from uploaded directory structure

Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

  • 1.
    Experience with Ingestionof Large Collections Stuart Kenny Research IT Trinity College Dublin
  • 2.
    Stuart Kenny Research IT TrinityCollege Dublin The Fairy Tales of Charles Perrault. Illustrated by Harry Clarke. Intro. Thomas Bodkin. London: George G. Harrap, [1922]. Internet Archive version of a copy in the New York Public Library. Web. 25 December 2012. My what a big collection you have!
  • 3.
    About DRI (https://repository.dri.ie/) ●DRI is an interactive trusted digital repository for contemporary and historical, social and cultural data held by Irish institutions ● RIA (lead), NUIM, TCD, DIT, NUIG, NCAD ● Partners: academic, cultural, social, government
  • 4.
    Outline • What’s ourproblem? • Example collections • Ingest solutions • Current ingest process • Possible future process
  • 5.
    Ingesting Objects • Ingestform o Suitable for single objects/small collections o Flat hierarchies o Simple metadata standards • Multiple standards o e.g., MARC, EAD o XML upload • How to handle complex standards, many objects?
  • 7.
    Example Collection: ClarkeStained Glass • MODS metadata • 10,025 objects • 42 sub-collections • 20,047 files, 2.82 TB • Problems: o Large number of objects o Data transfer
  • 8.
    Example Collection: TCDChildren’s Books • MARC metadata • 207,889 objects • 16 sub-collections • Problems: o Large number of objects o Very slow to ingest o Timeouts and errors
  • 9.
    Example Collection: KilkennyDesign Workshop • EAD metadata • 2,040 objects • 2,734 series/files • 2,231 files, 1.2GB • Problems: o Very complex metadata standard o Hierarchical structure
  • 10.
    EAD, and whyI don’t quite hate it as much as I did... • Single XML file upload • Structure encoded in metadata • URLs to files • But o One-shot ingest o How to edit/update? o Slow to ingest o Requires a lot of resources
  • 11.
    Sufia Batch Upload •Add multiple files • New work for each • Metadata for each work • How to handle multiple standards? • Different metadata for each work?
  • 12.
    Avalon Batch Ingest •Ingest package o Manifest file o Plus content files • Manifest file is spreadsheet o Metadata for items o Names of content files • Ingest package uploaded to Avalon DropBox
  • 13.
    Approach up tonow • Command line client o Enter text commands at ‘command prompt’ • Written in Ruby • Run locally by user • Metadata and asset files arranged in fixed directory structure • Client iterates over directory creates each object as single ingest
  • 15.
    Problems • Lack ofuser familiarity with command line • Multiple platform support o i.e., Windows • Difficulty of installing • Multiple single ingests o Slow o Error prone • Required lots of user support • Mostly in the end ingests performed by dev team
  • 16.
    Current Attempt • Web-basedUI • Borrow heavily from Avalon approach • Upload metadata XML plus assets to online storage • Add manifest spreadsheet o Each row contains path to metadata o Paths to zero or more asset files o Paths relative to online storage directory • Backend processes manifest and ingests as background task • UI updates status
  • 17.
  • 18.
    • Hydra BrowseEverything oGem to access cloud storage o DropBox, Google Drive… • User uploads files • In UI selects collection and manifest to ingest • Everything handled server side in background • Can view status in UI
  • 21.
    Outstanding Issues • Onlinestorage o Dropbox type storage size limits • Creating spreadsheet less easy than directory structure • Possible solutions o Provide online storage o Has to be per user o Generate required manifest from uploaded directory structure