• What digital “stuﬀ” do you have? Where
do you put it? How do you organize it, if
you do? How do you ﬁnd it when you
Tool of the week: Self-eﬃcacy
• In the course of your career, you will have to
do things you don’t entirely know how to do.
• Technical and non-!
• Without training, guidance, or clear instructions.
• No, of course we don’t teach you everything in library
• Learn to dive in despite imperfect knowledge.
• Use your common sense.
• Trust that those around you want you to succeed.
• If you need to, research! Always be ready to learn.
• Mentors are great... but they’re not babysitters.
• Accept imperfection.
• Please model these behaviors in my class!
Tip of the week: Staying informed
• Weblogs and newsfeeds are your friends.
• If you are not reading at least a few librarian blogs,
you are not staying informed.
• Can’t hurt to pick up some journal TOCs too.
• Blogs are faster than the published literature! And
often written by the same people.
• For (library) tech:
• Librarian in Black
• Planet Code4Lib
• Lifehacker, Gizmodo, Engadget
• Roy Tennant’s LJ columns
What is metadata?
• Heck, I dunno. I’m not sure that’s even a
• This is one reason I’m not a library-school
professor. Deﬁnitional pilpul bores me.
• Operationally: when we collect stuﬀ, we
take notes on it so we can organize it,
inventory it, ﬁnd it later, etc. Those
notes are metadata.
• Is MARC metadata? Well, of course!
• But many librarians don’t think about it that way.
Why are there so many
• Diﬀerent things described
• For an image, you want to know its bit depth and
colorspace. This has no meaning for a ﬁnding aid.
• Several targeted standards vastly easier to cope
with than one supposedly universal standard.
• Diﬀerent purposes
• More on this in a moment
• Diﬀerent provider and user communities
• Level of detail/speciﬁcity
• Wheel (or toothbrush) reinvention
Metadata ﬁle formats
• You can express metadata in an Excel
spreadsheet, a MARC record, XML, RDF...
• But some expressions are more readable, useful,
and reusable than others!
• Metadata librarians spend a lot of time ﬁxing and
transforming Other People’s Metadata, in as
automated a fashion as possible.
• Large majority of modern metadata
standards expressed in XML.
• Though RDF wants to be a contender, and XML is
only one way of several to express RDF.
So what’s this RDF thing all the
cool kids are talking about?
• Resource Description Framework
• by the W3C
• Like XML, RDF is more or less friendly to
whatever kind of metadata you want to
throw at it.
• Unlike XML, RDF is a data model designed for integrating
information from diﬀerent metadata vocabularies, and
expressing how items and metadata records relate to one
another. Links and linking!
• (Also, XML works for content, e.g. TEI. RDF doesn’t.)
(very) Basic RDF
• “Triple:” subject, property, value
• A little like subject, verb, object in English.
• Dorothea Salo is the author of “Innkeeper
at the Roach Motel.”
• Subject: either me or the article (works either way,
depending on property chosen)
• Property: authorship (“isAuthorOf” or “isBy”); often
comes from a controlled vocabulary like Dublin Core
• Value: either the article or me, depending
• One annoying thing: URIs as identiﬁers
• What is my URI? Or the article’s (several versions)?
• Several other annoying things about RDF, but they’re
• As the web linked documents and people,
it’s now time (say some) to link data.
• Not a simple proposition!
• RDF is hard. Calling it linked data doesn’t make it easier.
• Data modeling is hard.
• Data integration is hard. RDF makes it easier... up to a
point. Still HUGE problems around people using the
same term diﬀerently, other unexamined assumptions.
• Idea gaining traction among governments,
other big data providers.
• So we probably need to keep our eye on it.
• ALWAYS a good idea to think about how
other people might use your metadata.
Kinds of metadata
• Descriptive (“bibliographic”)
• Who made this? When? Where? What’s it about? Etc.
• What is this? What is its format? What made it? Etc.
• Who owns this? Who’s changed it? Who has what IP
rights over it? Who can see it? Etc.
• How is this thing put together?
• In practice, the landscape is muddier.
• Most standards have bits of two or more types.
• Also, “relationship” metadata coming to the fore.
• Metadata Object Description Schema
• Maintained by Library of Congress
• Stripped-down, human-readable MARC
• Sample: http://www.loc.gov/standards/mods/v3/
Technical metadata: MIX
• Metadata for Images in XML
• By Library of Congress, NISO
• Captures information about an image’s ﬁle
format and other technical characteristics
• Why? Think about ﬁle-format
• Sample document: http://www.loc.gov/standards/mix/
• Preservation Metadata Maintenance
• who comes up with these acronyms?
• Library of Congress, again
• Designed to track digital preservation
activity across an object’s lifecycle
• Samples: look in http://www.dlib.org/dlib/
• But be aware that PREMIS is usually embedded in
other metadata, like METS.
• Metadata Encoding and Transmission
• By... guess who?
• Wrapper for other kinds of metadata;
delineates the structure of a complex
• Samples: http://www.loc.gov/standards/mets/
Metadata spaghetti: TEI
• Text Encoding Initiative
• by the TEI Consortium
• For digital transcriptions of books,
manuscripts, dictionaries, etc. etc.
• Content standard, not metadata standard!
But contains its own “metadata header”
• This header sometimes reused in other contexts
• Moral: Sometimes content “embeds”
• This is OK, but should every content standard roll its
own internal metadata?
Where does metadata
• Human data entry
• Slow, expensive, error-prone
• Often semi-automatable (80/20 point)
• If you can automate, DO IT. Do not waste keystrokes!
• Auto-extracting from a content object
• Common for technical metadata
• Auto-capture by preservation system
• Common for some administrative metadata
• Grabbing from elsewhere
• From other metadata: “crosswalking”
• HTML screenscraping, Excel spreadsheets
• Issues: authority control? granularity? accuracy?
• What is this thing about?
• Plenty of variation in sources
• Author’s keyword vs. indexer’s descriptor
• Controlled vocabulary vs. free-form keywording
• Community tagging/“folksonomy”
• Mechanically-extracted keywords
• All of this matters if you’re searching!
Where does metadata live?
• In XML ﬁles (or MARC ﬁles, or...)
• In relational databases
• In RDF “triple stores” (special databases)
• In content objects (as with TEI)
• Or some combination of the above!
• E.g. DSpace: can accept metadata in an XML ﬁle; stores
all metadata in relational database
• Next trick: associating content with its
What is done with metadata?
• To search against it or use it to browse,
you need to “index” it ﬁrst.
• Turn it inside-out: records containing terms --> list
of terms and the records they appear in
• It’s all more complicated: stemming, phrases,
variant spellings, languages, stopwords, etc.
• The hot new indexing software is “Solr” from UVa.
Underlies Blacklight, which underlies Forward.
• Full-text search works the same way!
• Google’s index: MASSIVE database of words with
the web pages they appear in.
• Spider/crawler: program that follows links across
the web and indexes page content
• You have a bunch of words and the records
or documents they appear in. How do you
decide which records/pages to display ﬁrst?
• Traditionally in libraries: last-in-ﬁrst-out. Awful.
• Using document structure and metadata
• If the word’s in a title, heading, or subject ﬁeld, take it
more seriously than if it’s just in ordinary text.
• Term frequency: how often the search term shows up in
a given record/document
• Inverse document frequency: how rare the search term
is in the whole mass of records/documents.
Rare term “about” this
Common term word or Irrelevant
What other information can
be used to gauge relevance?
• People pointing
• Google: PageRank, based on counting links to a
• Scholarly communication: many metrics based on
later citation of articles
• People choosing
• Google also up-votes pages based on people
clicking on them in search results.
• Individual or social history of interests
• Amazon, Netﬂix
• Notice who’s doing this and who isn’t.
• Serious question: what about privacy?
• Making sure that your page turns up in
searches for relevant terms.
• Done maliciously, this amounts to spam. Google
spends LOTS of eﬀort despamming its index.
• Clean markup helps. So does putting
highly relevant terms in highly visible/
• Also, don’t overload pages! Dilutes vocabulary.
What else can you do with
• Point people to PEOPLE and SERVICES,
not just search results!
• Point people to context that will help
them evaluate search results.
• We know people just throw search terms at boxes.
We might as well work with that.
• This may well be the best work Forward
A word about GIS
• “Geographic Information Systems”
• It’s metadata all the way down! Metadata
• Also a lot about how to represent and visualize that
• And how to mash it up with other data.
• Heavily based on relational-database
• HOT JOB MARKET. If you can get trained, do.
Finding and using
• Nobody knows every metadata
standard out there. I sure don’t.
• But faced with a new standard, I may
have to get up to speed fast.
• I may even be making adoption decisions.
• So here’s how I do it.
Getting up to speed
• Find its website. If it doesn’t have a
website, you don’t want to use it.
• Is the website current? Is there recent activity?
• Is there a list of who’s using this standard?
• Find a sample record.
• How is this standard expressed? XML, RDF, what?
• Does it pass a sniﬀ test?
• Find the documentation and community.
• “Tag libraries” and “data dictionaries” especially helpful.
• Primers, “getting started” documents also nice.
• Look for tools.
• Authoring/crosswalk tools (and programming libraries)
• Validation tools