Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Published in: Education


  1. 1. Metadata 25 October 2010
  2. 2. Weekly reflection • What digital “stuff” do you have? Where do you put it? How do you organize it, if you do? How do you find it when you need it?
  3. 3. • In the course of your career, you will have to do things you don’t entirely know how to do. • Technical and non-! • Without training, guidance, or clear instructions. • No, of course we don’t teach you everything in library school! • Learn to dive in despite imperfect knowledge. • Use your common sense. • Trust that those around you want you to succeed. • If you need to, research! Always be ready to learn. • Mentors are great... but they’re not babysitters. • Accept imperfection. • Please model these behaviors in my class! Tool of the week: Self-efficacy
  4. 4. Tip of the week: Staying informed • Weblogs and newsfeeds are your friends. • If you are not reading at least a few librarian blogs, you are not staying informed. • Can’t hurt to pick up some journal TOCs too. • Blogs are faster than the published literature! And often written by the same people. • For (library) tech: • Librarian in Black • Planet Code4Lib • • Lifehacker, Gizmodo, Engadget • Roy Tennant’s LJ columns
  5. 5. What is metadata? • Heck, I dunno. I’m not sure that’s even a useful question. • This is one reason I’m not a library-school professor. Definitional pilpul bores me. • Operationally: when we collect stuff, we take notes on it so we can organize it, inventory it, find it later, etc. Those notes are metadata. • Is MARC metadata? Well, of course! • But many librarians don’t think about it that way.
  6. 6. Why are there so many metadata standards? • Different things described • For an image, you want to know its bit depth and colorspace. This has no meaning for a finding aid. • Several targeted standards vastly easier to cope with than one supposedly universal standard. • Different purposes • More on this in a moment • Different provider and user communities • Level of detail/specificity • Wheel (or toothbrush) reinvention
  7. 7. Metadata file formats • You can express metadata in an Excel spreadsheet, a MARC record, XML, RDF... • But some expressions are more readable, useful, and reusable than others! • Metadata librarians spend a lot of time fixing and transforming Other People’s Metadata, in as automated a fashion as possible. • Large majority of modern metadata standards expressed in XML. • Though RDF wants to be a contender, and XML is only one way of several to express RDF.
  8. 8. So what’s this RDF thing all the cool kids are talking about? • Resource Description Framework • by the W3C • Like XML, RDF is more or less friendly to whatever kind of metadata you want to throw at it. • Unlike XML, RDF is a data model designed for integrating information from different metadata vocabularies, and expressing how items and metadata records relate to one another. Links and linking! • (Also, XML works for content, e.g. TEI. RDF doesn’t.)
  9. 9. (very) Basic RDF • “Triple:” subject, property, value • A little like subject, verb, object in English. • Dorothea Salo is the author of “Innkeeper at the Roach Motel.” • Subject: either me or the article (works either way, depending on property chosen) • Property: authorship (“isAuthorOf” or “isBy”); often comes from a controlled vocabulary like Dublin Core • Value: either the article or me, depending • One annoying thing: URIs as identifiers • What is my URI? Or the article’s (several versions)? • Several other annoying things about RDF, but they’re super-nerdy.
  10. 10. Linked data • As the web linked documents and people, it’s now time (say some) to link data. • Not a simple proposition! • RDF is hard. Calling it linked data doesn’t make it easier. • Data modeling is hard. • Data integration is hard. RDF makes it easier... up to a point. Still HUGE problems around people using the same term differently, other unexamined assumptions. • Idea gaining traction among governments, other big data providers. • So we probably need to keep our eye on it. • ALWAYS a good idea to think about how other people might use your metadata.
  11. 11. Kinds of metadata • Descriptive (“bibliographic”) • Who made this? When? Where? What’s it about? Etc. • Technical • What is this? What is its format? What made it? Etc. • Administrative • Who owns this? Who’s changed it? Who has what IP rights over it? Who can see it? Etc. • Structural • How is this thing put together? • In practice, the landscape is muddier. • Most standards have bits of two or more types. • Also, “relationship” metadata coming to the fore.
  12. 12. Descriptive metadata: MODS • Metadata Object Description Schema • Maintained by Library of Congress • Stripped-down, human-readable MARC in XML • • Sample: mods99042030.xml
  13. 13. Technical metadata: MIX • Metadata for Images in XML • By Library of Congress, NISO • Captures information about an image’s file format and other technical characteristics • Why? Think about file-format obsolescence. • • Sample document: instances/test_mix10.xml
  14. 14. Administrative metadata: PREMIS • Preservation Metadata Maintenance Activity • who comes up with these acronyms? • Library of Congress, again • Designed to track digital preservation activity across an object’s lifecycle • • Samples: look in september08/dappert/09dappert.html • But be aware that PREMIS is usually embedded in other metadata, like METS.
  15. 15. Structural metadata: METS • Metadata Encoding and Transmission Standard • By... guess who? • Wrapper for other kinds of metadata; delineates the structure of a complex digital object • • Samples: mets-examples.html
  16. 16. Metadata spaghetti: TEI • Text Encoding Initiative • by the TEI Consortium • For digital transcriptions of books, manuscripts, dictionaries, etc. etc. • Content standard, not metadata standard! But contains its own “metadata header” • This header sometimes reused in other contexts • Moral: Sometimes content “embeds” metadata. • This is OK, but should every content standard roll its own internal metadata?
  17. 17. Where does metadata come from? • Human data entry • Slow, expensive, error-prone • Often semi-automatable (80/20 point) • If you can automate, DO IT. Do not waste keystrokes! • Auto-extracting from a content object • Common for technical metadata • Auto-capture by preservation system • Common for some administrative metadata • Grabbing from elsewhere • From other metadata: “crosswalking” • HTML screenscraping, Excel spreadsheets • Issues: authority control? granularity? accuracy?
  18. 18. Subject metadata, specifically • What is this thing about? • Plenty of variation in sources • Author’s keyword vs. indexer’s descriptor • Controlled vocabulary vs. free-form keywording • Community tagging/“folksonomy” • Mechanically-extracted keywords • All of this matters if you’re searching!
  19. 19. Where does metadata live? • In XML files (or MARC files, or...) • In relational databases • In RDF “triple stores” (special databases) • In content objects (as with TEI) • Or some combination of the above! • E.g. DSpace: can accept metadata in an XML file; stores all metadata in relational database • Next trick: associating content with its metadata!
  20. 20. What is done with metadata? • To search against it or use it to browse, you need to “index” it first. • Turn it inside-out: records containing terms --> list of terms and the records they appear in • It’s all more complicated: stemming, phrases, variant spellings, languages, stopwords, etc. • The hot new indexing software is “Solr” from UVa. Underlies Blacklight, which underlies Forward. • Full-text search works the same way! • Google’s index: MASSIVE database of words with the web pages they appear in. • Spider/crawler: program that follows links across the web and indexes page content
  21. 21. Relevance ranking • You have a bunch of words and the records or documents they appear in. How do you decide which records/pages to display first? • Traditionally in libraries: last-in-first-out. Awful. • Using document structure and metadata • If the word’s in a title, heading, or subject field, take it more seriously than if it’s just in ordinary text. • TF/IDF • Term frequency: how often the search term shows up in a given record/document • Inverse document frequency: how rare the search term is in the whole mass of records/documents.
  22. 22. Super- relevant! Record not “about” this term Overused word or stopword Irrelevant TF (one record) IDF (whole corpus) High Low Rare term Common term
  23. 23. What other information can be used to gauge relevance? • People pointing • Google: PageRank, based on counting links to a document • Scholarly communication: many metrics based on later citation of articles • People choosing • Google also up-votes pages based on people clicking on them in search results. • Individual or social history of interests • Amazon, Netflix • Notice who’s doing this and who isn’t. • Serious question: what about privacy?
  24. 24.
  25. 25. Search engine optimization • Making sure that your page turns up in searches for relevant terms. • Done maliciously, this amounts to spam. Google spends LOTS of effort despamming its index. • Clean markup helps. So does putting highly relevant terms in highly visible/ important locations. • Also, don’t overload pages! Dilutes vocabulary.
  26. 26. What else can you do with relevance information? • Point people to PEOPLE and SERVICES, not just search results! • Point people to context that will help them evaluate search results. • We know people just throw search terms at boxes. We might as well work with that. • This may well be the best work Forward is doing.
  27. 27. A word about GIS • “Geographic Information Systems” • It’s metadata all the way down! Metadata about places. • Also a lot about how to represent and visualize that metadata. • And how to mash it up with other data. • Heavily based on relational-database technology. • HOT JOB MARKET. If you can get trained, do.
  28. 28. Finding and using metadata standards • Nobody knows every metadata standard out there. I sure don’t. • But faced with a new standard, I may have to get up to speed fast. • I may even be making adoption decisions. • So here’s how I do it.
  29. 29. Getting up to speed • Find its website. If it doesn’t have a website, you don’t want to use it. • Is the website current? Is there recent activity? • Is there a list of who’s using this standard? • Find a sample record. • How is this standard expressed? XML, RDF, what? • Does it pass a sniff test? • Find the documentation and community. • “Tag libraries” and “data dictionaries” especially helpful. • Primers, “getting started” documents also nice. • Look for tools. • Authoring/crosswalk tools (and programming libraries) • Validation tools