Loading...
Flash Player 9 (or above) is needed to view slideshows. We have detected that you do not have it on your computer.To install it, go here
Digital Library Infrastructure for a Million Books
Describes what library infrastructure is needed for digital humanities use of mass digitized collections. Given at the Million Books Workshop, May 2007.
436 views | comments | 0 favorites | 22 downloads | 0 embeds (Stats)
More Info
This slideshow is Public
Total Views: 436 on Slideshare: 436 from embeds: 0
Slideshow Transcript
- Slide 1: Digital Library Infrastructure
for a Million Books:
Synthesis (and Opinion)
Steve Toub
California Digital Library
- Slide 2: Having books talk to each
other in the background
• Ability to enable books to talk to each other
– Too costly to pre-define all relationships a priori
– Need a system that grows, learns dynamically
• Requires:
– Rich markup of high-value texts
– Ability to expose structure (e.g., microformats)
OR address latent structure
• Including well-managed identifiers within a text
– Ability to define relationships between hooks in
one text and one in another
• If not RDF/OWL, something conceptually similar
– Tools, APIs (standards,service registries, later)
- Slide 3: “How many libraries do we need?”
• We must operate under the assumption that
that everything will live at the network level
• Dividing the heterogeneous million book
corpus in to sub-corpora, by domain, is
necessary to apply tools effectively
• Unless things change radically, humanities
scholars, libraries, and the government don’t
have the resources or will to shape how
digital library services will exist in an open,
interoperable way at the network level
- Slide 4: “Explore effectiveness of
discovery tools”
• Listed only as a single bullet point on
one of Bruce Robertson’s slides but
something near and dear to my heart
• Example:
– George Bancroft's History of the United States: Vol.
1, History of the Colonization of the United States
- Slide 5: Bancroft’s History of the US:
Perseus
- Slide 6: Bancroft’s History of the US:
MoA
- Slide 7: Bancroft’s History of the US:
MoA 2
- Slide 8: Bancroft’s History of the US:
Google
- Slide 9: Bancroft’s History of the US:
Google 2
- Slide 10: Bancroft’s History of the US:
Google 3
- Slide 11: Bancroft’s History of the US:
Google 4
- Slide 12: Bancroft’s History of the US:
Google 5
- Slide 13: Bancroft’s History of the US:
IA
- Slide 14: Bancroft’s History of the US:
IA 2
- Slide 15: Bancroft’s History of the US:
IA 3
- Slide 16: Bancroft’s History of the US:
IA 4
- Slide 17: Bancroft’s History of the US:
Gale
- Slide 18: Bancroft’s History of the US:
WorldCat Editions
- Slide 19: Bancroft’s History of the US:
WorldCat Results
- Slide 20: Discovery infrastructure
at the network level
• Evangelize metadata/content exposure APIs
– OAI-PMH, COinS, hCite,RDFa, unAPI, OAI-ORE
• Ability to aggregate raw data and scrubbed data at the
network level (plus scrubbing tools and loaders)
• An authoritative content registry
(i.e., OCLC Registry of Digital Works)
– Work-level (and expression-level) identifiers
– Low-barrier workflows for update + expose/harvest
• Formalized ways to express and expose content
relationships: xISBN + thingISBN + ML approaches
– Related objects (versions, editions, translations, dupes, …)
– Compound objects (chapters, overlay journals, …)
– References (annotations, quotations, excerpts, …)
- Slide 21: Need more focus on usage,
research process as a whole
• Tools like Piggy Bank, Zotero can capture
• Count transactions (views, downloads,
annotations, citations…) in standardized ways
• Standardize log data (e.g., COUNTER) and
its exchange (e.g., SUSHI)
• Consolidate transaction data to close
feedback loops and improve services
– Metrics (e.g., MESUR, search analytics)
– Ranking, recommending, social discovery
– Matching user terms to: authority files; texts
- Slide 22: Infrastructure for identity/users
• OpenID is only a hint of what's to come
• Identity is the foundation for higher-level
services: trust, authority, verified claims,
reputation, groups/memberships, …
• How can this emerging ecosystem leverage
existing infrastructure?
– Authority files
– LDAP servers
– Peer-review, citations, etc.
- Slide 23: Can’t take these for granted
• Preservation
– Anyone who has looked at credible cost breakdown
for a well-managed long-term digital preservation
repository will understand that the libraries will
always preserve NOT a given
• Intellectual property rights
– GBS not showing many public domain items
– Keeping tabs on orphan works
- Slide 24: Sustainable business models
• Crucial to think about economic sustainability
and shape realistic economic incentives
• Why don’t libraries license JSTOR indefinitely
so it could be opened up free for all?
• How to incentivize disciplinary collaboration
that go across institutions?
• What’s our relationship to players at network
level: Google, MSFT, IA, OCLC, Amazon, …
• Who pays for the common infrastructure?
• Organizational infrastructure to:
• Enable free riders? Feed and care for free kittens?
- Slide 25: Where to begin:
Any consensus?
• Keep increasing available texts?
• Compelling applications?
• Easy to use tools?
• APIs, service registries?
• Key software engineering tasks?
• Marketing and outreach?