Bodleian Library's DAMS system
Upcoming SlideShare
Loading in...5
×
 

Bodleian Library's DAMS system

on

  • 808 views

Internal briefing on what the DAMS system is and isn't, and why certain choices were made in its design.

Internal briefing on what the DAMS system is and isn't, and why certain choices were made in its design.

Statistics

Views

Total Views
808
Views on SlideShare
806
Embed Views
2

Actions

Likes
0
Downloads
3
Comments
0

1 Embed 2

http://www.slideshare.net 2

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Although the title of our talk is couched in terms of “preservation” I’m going to switch terms and instead speak about what we now consider the more expansive and appropriate word, “curation”. Digital curation is the set of activities focused on maintaining and adding value to a body of trusted digital content. There are two key points here: First, the value of a curated digital asset is not fixed at the time of its creation. With the burgeoning phenomenon of social networking, value can be added to an asset over its lifetime. Second, curation encompasses both preservation and access. Although these are often considered disparate activities, we see them instead as complementary, Preservation ensuring access over time, While access depends on preservation up to a point of time.

Bodleian Library's DAMS system Bodleian Library's DAMS system Presentation Transcript

  • 'An institutional repository needs to be a service with continuity behind it........Institutions need to recognize that they are making commitments for the long term.' Clifford Lynch, 2004
  • Continuity and Services We anticipate that the content our systems will hold or even hold now will outlive any software, hardware or even people involved in its creation.
  • Continuity and Services What content are we anticipating to store for the long-term? View slide
  • Continuity and Services
    • “Bookshelves for the 21 st Century”
      • Legal Deposit View slide
      • Voluntary Deposit
      • Digitisation
      • Extended Remit - Research Data
      • Administrative data – researchers, projects, places, dates
      • And anything else deemed worthy.
  • Continuity and Services What core principles do you design a system with when you know the content will outlive you?
  • Continuity and Services
    • Oxford DAMS – Digital Asset Management System
      • Not a single piece of software, more a set of guiding principles and aims applied to software and hardware.
      • The DAMS is the hardware environment (in my opinion)
  • Continuity and Services #1 Keep it simple – simple is maintainable, easy to understand.
  • Continuity and Services
    • #1 Keep it simple – simple is maintainable, easy to understand.
      • Think about how much we have achieved with the simple concept of using a single command on a simple protocol (HTTP GET) to get an HTML document. (HTML being a loose, ill-constrained set of elements.)
    “GET / HTTP1.1”
  • Continuity and Services
    • #1 Keep it simple – simple is maintainable, easy to understand.
      • simple protocols such as HTTP , simple formats such as JSON and API patterns such as REST are heavily favoured
      • Remember, at some point, administration and development of the system will need to be handed over (as I am doing now!).
  • Continuity and Services #2 Everything but the content is replaceable
  • Continuity and Services
    • #2 Everything but the content is replaceable
      • The content that is being held by the system at any one time is what is important, not the surrounding infrastructure
      • The content should survive and be reusable in the event that services like databases, search indexes, etc crash or corrupt.
  • Continuity and Services 1 st logical outcome: Do not use complex packages to hold your content! Minimise the amount of software and work that you need to maintain to understand how your content is stored.
  • Continuity and Services
    • 1 st logical outcome:
      • Do not use complex packages to hold your content!
      • The basic, low-level digital object in our system is a “bag”
        • Each bag has a manifest, currently in either FOXML or RDF, which serves to identify the component parts and their inter-relationships.
  • Continuity and Services #3 The services in the system are replaceable and can be rebuilt from the content.
  • Continuity and Services
    • #3 The services in the system are replaceable and can be rebuilt from the content.
      • Services such as:
        • Search indexes (eg via Lucene/Solr)
        • Databases and RDF triplestores (MySQL, Mulgara, etc)
        • XML databases (eXist db, etc)
        • Object management middleware (Fedora)
        • Object registries
  • Continuity and Services
    • 2 nd logical outcome:
      • A content store must be able to tell services when items have changed, been created, or deleted, and what items an individual store holds.
      • We need slightly smarter storage to do this over a network. Local filesystems have had this paradigm for a long time now, but we need to migrate it to the network level.
  • Continuity and Services
    • 2 nd logical outcome:
      • A content store must be able to tell services when items have changed, been created, or deleted, and what items an individual store holds.
      • Each store and set of services needs a way to pass messages in a transactional way, via a messaging system .
        • This can be as simple as an HTTP call or as robust as an audit-able, message queue transaction.
  • Continuity and Services
    • 3 rd logical outcome:
      • The storage must store the information and documentation for the standards and conventions its content follows.
      • The storage must be able to describe itself.
  • Continuity and Services #4 We must lower our dependence on any one node in the system – be it hardware, software or even a person.
  • Continuity and Services
    • #4 We must lower our dependence on any one node in the system – be it hardware, software or even a person.
      • This is another reason for #1 – keeping things simple.
        • Lowers the cost of hardware replacement and upgrade – simple hardware tends to be cheaper and more readily available; lowering the number of crucial features broadens what can be used.
        • Software – Oxford has been around for 1000 years. Vendors will not be.
        • People – a simpler system is more readily understood by new/replacement engineers.
  • Challenges to maintaining Continuity
    • Flexibility
    • Scalability
    • Longevity
    • Availability
    • Sustainability
    • Interoperability
  • Challenges to maintaining Continuity
    • Flexibility
      • Open standards and open APIs allow us to provide the tools people need to create the experience with the content they want.
      • 'Text' based metadata – XML, RDF, RDFa, JSON
      • Componentised web services providing open APIs – flexibility through dynamically selecting what services run at anytime.
      • Open Source
  • Challenges to maintaining Continuity
    • Scalability
      • We need to anticipate exponential growth in demand, with the knowledge that storage will be longterm
      • A rough estimate of all the data produced in the world every second is 7km of CDROMs – broadcast video, big science, CCTVs, cameras, research, etc. This figure is only ever going to go up.
      • To scale like the web, we have to be like the web; not one single black box and workflow, but a distributed net of storage and services, simply interlinked.
      • The “Billion file [inode] problem” must be avoided
  • Challenges to maintaining Continuity
    • Longevity
      • Live Replicas (Backup Problems)
        • MAID (Power)
      • Self-healing Systems (Resilience)
      • Simplicity of Interfaces
      • Avoid 3 rd Party dependence
      • Support Heterogeneity
        • Resolvers/Abstraction Layers
  • Challenges to maintaining Continuity
    • Availability
      • Basic IT availability
      • Enhanced long-term availability
      • Archival recoverability
      • Digital Preservation
        • Conversion
        • Emulation
        • Archive Preservation
  • Challenges to maintaining Continuity
    • Sustainability
      • Budget and cost as a conventional library
      • Factor archival costs into projects
      • Leverage content to generate income
      • Migrate skills
  • Challenges to maintaining Continuity
    • Interoperability
      • Interoperability is an ongoing process
        • Support for emerging and established standards
      • Persistent, stable, well-defined interfaces
      • Ideally implement interfaces bidirectionally
      • Avoid low-level interfaces – abstract as much as feasible
      • Embrace the web – if you do it right, people will reuse your content in ways you never thought of and never thought possible.
  • DAMS overview
    • The following schematics might be not reflect the absolute current state, but they give the right idea of direction of growth of the DAMS.
  • Phase 1 – current hardware VMWare ESX VM Image Store (2TB) HC 1 HC 2 MDICS
  • Phase 1 – current hardware Content layer VMWare ESX VM Image Store (2TB) HC 1 HC 2 MDICS
  • Phase 1 – current hardware Service storage layer VMWare ESX VM Image Store (2TB) HC 1 HC 2 MDICS
  • Phase 1 – current hardware Service execution layer VMWare ESX VM Image Store (2TB) HC 1 HC 2 MDICS
  • Phase 2 – (expecting hardware delivery this month) VMWare ESX VMWare ESX VM Image Store (2TB) VM Image Store (2TB) HC 1 HC 2 MDICS
  • Aim – multisite (add sites as required to scale up) VMWare ESX VMWare ESX VM Image Store (2TB) VM Image Store (2TB) HC 1 Thumper 1 Thumper 2 HC 2 Sun Rays FutureArch MDICS RSL Osney?
  • No Monolith? Chaos?
    • The side-effect is that there is no single point of entry into the DAMS
      • Beneficial for reasons stated before
      • Moves the problem from continual programmatic work just to 'tread water' to adapting to needs and actual usecases.
      • Big Issue: Needs and Uses become much more important and these are hard to tackle!
  • Commonalities
    • What conventions and standards are common between the services and archives?
      • Pro: The more commonalities we adopt, the more time we can spend addressing the real problems.
      • Con: Too much of a prescribed system can easily lead to paralysis (as we have seen in other big institutional projects outside of Oxford)
  • Commonalities
    • Global Identifiers
      • When source content enters a system on the DAMS it is given an identifier that is understandable outside of the system
      • e.g. Not just an internal table id in a database, but a [table, etc] id that has meaning outside of the database too.
      • URI is a current, widely adopted means of having globally understood identifiers so this is a nice commonality to adopt.
  • Commonalities
    • Consider:
      • You have an auto-incrementing primary key in a database table.
      • You adopt a namespace that is unique to your system (but preferably not the project name)
      • Eg name:, etc rather than brii:, letter: rather than cok:
      • It is helpful if the namespace implies the source of the content.
      • Then you would have name:1, name:2 etc
  • Commonalities
    • BUT!
      • There will be an awful lot of systems which deal with similar types of things.
      • Is name:1 from one of Su's systems the same as name:1 from one of Anusha's systems?
      • Or if name:1 in an OUCS system - that we dont even know about yet - is the same?
  • UUIDs (or GUIDs)
    • Fundamentally a very big number
    • So big, that if it is created from a random enough source , you are extremely unlikely to ever create the same one twice.
    • Ever.
    • I really mean it.
    • The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.
  • UUIDs (or GUIDs)
    • From Wikipedia:
      • The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. Thus, anyone can create a UUID and use it to identify something with reasonable confidence that the identifier will never be unintentionally used by anyone for anything else. Information labeled with UUIDs can therefore be later combined into a single database without needing to resolve name conflicts.
  • UUIDs (or GUIDs)
    • What's a GUID?
      • Microsoft have issues about adopting anything that already exists without owning it. So they adopted everything about UUID but changed the first letter and called it their own invention.
      • “Globally” rather than “Universally”
      • Everything else about it is the same.
      • Yeah, I think it's pretty pointless too, but it's a recurring theme with MS.
  • UUID vs local ID
    • One of the problems is that some people don't like to use big, long hexadecimal numbers.
    • We call these people 'normal'
    • One of my biggest mistakes with ORA was using this number in visual priority over a local, friendly ID.
    • Whilst this makes the item citable for much longer, it makes it way more unfriendly.
    • -> I didn't do enough URL masking.
  • External Identifiers
    • It's rare to get any content that doesn't have at least one identifier of some kind.
      • from a controlled, centralised system of ids like DOI
      • To a local id like 'letter 275'
    • All of these are important to capture too! They represent the 'labels' that various user groups use to refer to the content.
  • Identifiers - Summary
    • Every 'thing' in your system will have more than one identifier or label. This is a Good Thing.
    • Give every 'thing' in your system identifiers which you, your users and your system will use.
      • A local one is handy ('name:1'), but
      • A global one is essential (uuid:...)
  • I glossed over 'Thing', didn't I?
    • Instead of considering metadata as a set of records, the aim is to have metadata that model or represent that which we are archiving.
      • If we have a book, X, written solely by an author Y, then we should have an author object Y linked as the author to a book object X.
      • [Things get more complicated when we start considering versions, dates, and Work-Manifestations, but more on that later]
    • The archives should be object-orientated.
  • Global Ids? Why?
    • A global ID provides a 'node', a hook on which we can hang other information and from and to which we can make links.
      • [Typed links – links which have more meaning than 'A is linked to B' – links more like 'A is authored by B and C']
    • A common label for the way this is modelling information is 'Linked Data'
  • Guidelines for Linked Data
    • Use URIs as names for things
    • Use HTTP URIs so that people can look up those names.
    • When someone looks up a URI, provide useful information, [ using the standards (RDF, SPARQL) ]
    • Include links to other URIs. so that they can discover more things.
      • By Tim Berners-Lee (brackets are mine tho')
      • http://www.w3.org/DesignIssues/LinkedData.html
  • RDF and SPARQL
    • This is the point that I don't quite agree with as strongly as Sir Tim. Let's break it down:
    • URIs must supply widely understood metadata formats
      • Agreed
    • Services that work in an understood manner
      • Also agreed
    • URIs must supply RDF?
      • Amongst other standards, yes, agreed.
    • Services must supply SPARQL?
      • Hmmm..... I don't agree. Services should be useful and wanted – not necessarily SPARQL-based.
  • Linked Data... Semantic Web...?
    • They are slightly different, but you'll never get a straight answer on the differences.
    • By and large, those that talk about creating Linked Data are more pragmatic than those talking about linking in with the Semantic Web.
    • But the opposite might be true in some cases ;)
  • Playing well with others
    • The key is to think about your content in terms of it being discovered in your absence.
      • How might they understand the terms?
      • Meanings of nodes and links? [“Nouns” and “verbs”]”?
    • RDF provides ways to share this information. Model your terms with a view that this information will be shared.... LINKED DATA!
  • Curation - Microservices i.e. What we do with the content we have or the content we will get?
    • Digital Curation
      Activities focused on maintaining and adding value to trusted digital content Encompasses preservation and access, which are complementary, not disparate functions
      • Preservation ensures access over time
      • Access depends on preservation up to a point in time
      How can we make the “Save” button really mean “save”?