Your SlideShare is downloading. ×
Metadata
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Metadata

1,935
views

Published on

Published in: Education

2 Comments
1 Like
Statistics
Notes
  • a good overview of metadata. thanks for sharing.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Lots of savvy advice for LIS types!

    via http://bibwild.wordpress.com/2010/12/06/on-search-ranking/ where Dorothea Salo says: 'I agree, and that’s why I modded my “intro to libtech” syllabus to introduce TF/IDF, page ranking, and so on into the equation'

    See 20-24 on indexing
    21-22 for td/idf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,935
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
25
Comments
2
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Metadata 25 October 2010
  • 2. Weekly reflection • What digital “stuff” do you have? Where do you put it? How do you organize it, if you do? How do you find it when you need it?
  • 3. • In the course of your career, you will have to do things you don’t entirely know how to do. • Technical and non-! • Without training, guidance, or clear instructions. • No, of course we don’t teach you everything in library school! • Learn to dive in despite imperfect knowledge. • Use your common sense. • Trust that those around you want you to succeed. • If you need to, research! Always be ready to learn. • Mentors are great... but they’re not babysitters. • Accept imperfection. • Please model these behaviors in my class! Tool of the week: Self-efficacy
  • 4. Tip of the week: Staying informed • Weblogs and newsfeeds are your friends. • If you are not reading at least a few librarian blogs, you are not staying informed. • Can’t hurt to pick up some journal TOCs too. • Blogs are faster than the published literature! And often written by the same people. • For (library) tech: • Librarian in Black • Planet Code4Lib • librarian.net • Lifehacker, Gizmodo, Engadget • Roy Tennant’s LJ columns
  • 5. What is metadata? • Heck, I dunno. I’m not sure that’s even a useful question. • This is one reason I’m not a library-school professor. Definitional pilpul bores me. • Operationally: when we collect stuff, we take notes on it so we can organize it, inventory it, find it later, etc. Those notes are metadata. • Is MARC metadata? Well, of course! • But many librarians don’t think about it that way.
  • 6. Why are there so many metadata standards? • Different things described • For an image, you want to know its bit depth and colorspace. This has no meaning for a finding aid. • Several targeted standards vastly easier to cope with than one supposedly universal standard. • Different purposes • More on this in a moment • Different provider and user communities • Level of detail/specificity • Wheel (or toothbrush) reinvention
  • 7. Metadata file formats • You can express metadata in an Excel spreadsheet, a MARC record, XML, RDF... • But some expressions are more readable, useful, and reusable than others! • Metadata librarians spend a lot of time fixing and transforming Other People’s Metadata, in as automated a fashion as possible. • Large majority of modern metadata standards expressed in XML. • Though RDF wants to be a contender, and XML is only one way of several to express RDF.
  • 8. So what’s this RDF thing all the cool kids are talking about? • Resource Description Framework • by the W3C • Like XML, RDF is more or less friendly to whatever kind of metadata you want to throw at it. • Unlike XML, RDF is a data model designed for integrating information from different metadata vocabularies, and expressing how items and metadata records relate to one another. Links and linking! • (Also, XML works for content, e.g. TEI. RDF doesn’t.)
  • 9. (very) Basic RDF • “Triple:” subject, property, value • A little like subject, verb, object in English. • Dorothea Salo is the author of “Innkeeper at the Roach Motel.” • Subject: either me or the article (works either way, depending on property chosen) • Property: authorship (“isAuthorOf” or “isBy”); often comes from a controlled vocabulary like Dublin Core • Value: either the article or me, depending • One annoying thing: URIs as identifiers • What is my URI? Or the article’s (several versions)? • Several other annoying things about RDF, but they’re super-nerdy.
  • 10. Linked data • As the web linked documents and people, it’s now time (say some) to link data. • Not a simple proposition! • RDF is hard. Calling it linked data doesn’t make it easier. • Data modeling is hard. • Data integration is hard. RDF makes it easier... up to a point. Still HUGE problems around people using the same term differently, other unexamined assumptions. • Idea gaining traction among governments, other big data providers. • So we probably need to keep our eye on it. • ALWAYS a good idea to think about how other people might use your metadata.
  • 11. Kinds of metadata • Descriptive (“bibliographic”) • Who made this? When? Where? What’s it about? Etc. • Technical • What is this? What is its format? What made it? Etc. • Administrative • Who owns this? Who’s changed it? Who has what IP rights over it? Who can see it? Etc. • Structural • How is this thing put together? • In practice, the landscape is muddier. • Most standards have bits of two or more types. • Also, “relationship” metadata coming to the fore.
  • 12. Descriptive metadata: MODS • Metadata Object Description Schema • Maintained by Library of Congress • Stripped-down, human-readable MARC in XML • http://www.loc.gov/standards/mods/ • Sample: http://www.loc.gov/standards/mods/v3/ mods99042030.xml
  • 13. Technical metadata: MIX • Metadata for Images in XML • By Library of Congress, NISO • Captures information about an image’s file format and other technical characteristics • Why? Think about file-format obsolescence. • http://www.loc.gov/standards/mix/ • Sample document: http://www.loc.gov/standards/mix/ instances/test_mix10.xml
  • 14. Administrative metadata: PREMIS • Preservation Metadata Maintenance Activity • who comes up with these acronyms? • Library of Congress, again • Designed to track digital preservation activity across an object’s lifecycle • http://www.loc.gov/standards/premis/ • Samples: look in http://www.dlib.org/dlib/ september08/dappert/09dappert.html • But be aware that PREMIS is usually embedded in other metadata, like METS.
  • 15. Structural metadata: METS • Metadata Encoding and Transmission Standard • By... guess who? • Wrapper for other kinds of metadata; delineates the structure of a complex digital object • http://www.loc.gov/standards/mets/ • Samples: http://www.loc.gov/standards/mets/ mets-examples.html
  • 16. Metadata spaghetti: TEI • Text Encoding Initiative • by the TEI Consortium • For digital transcriptions of books, manuscripts, dictionaries, etc. etc. • Content standard, not metadata standard! But contains its own “metadata header” • This header sometimes reused in other contexts • Moral: Sometimes content “embeds” metadata. • This is OK, but should every content standard roll its own internal metadata?
  • 17. Where does metadata come from? • Human data entry • Slow, expensive, error-prone • Often semi-automatable (80/20 point) • If you can automate, DO IT. Do not waste keystrokes! • Auto-extracting from a content object • Common for technical metadata • Auto-capture by preservation system • Common for some administrative metadata • Grabbing from elsewhere • From other metadata: “crosswalking” • HTML screenscraping, Excel spreadsheets • Issues: authority control? granularity? accuracy?
  • 18. Subject metadata, specifically • What is this thing about? • Plenty of variation in sources • Author’s keyword vs. indexer’s descriptor • Controlled vocabulary vs. free-form keywording • Community tagging/“folksonomy” • Mechanically-extracted keywords • All of this matters if you’re searching!
  • 19. Where does metadata live? • In XML files (or MARC files, or...) • In relational databases • In RDF “triple stores” (special databases) • In content objects (as with TEI) • Or some combination of the above! • E.g. DSpace: can accept metadata in an XML file; stores all metadata in relational database • Next trick: associating content with its metadata!
  • 20. What is done with metadata? • To search against it or use it to browse, you need to “index” it first. • Turn it inside-out: records containing terms --> list of terms and the records they appear in • It’s all more complicated: stemming, phrases, variant spellings, languages, stopwords, etc. • The hot new indexing software is “Solr” from UVa. Underlies Blacklight, which underlies Forward. • Full-text search works the same way! • Google’s index: MASSIVE database of words with the web pages they appear in. • Spider/crawler: program that follows links across the web and indexes page content
  • 21. Relevance ranking • You have a bunch of words and the records or documents they appear in. How do you decide which records/pages to display first? • Traditionally in libraries: last-in-first-out. Awful. • Using document structure and metadata • If the word’s in a title, heading, or subject field, take it more seriously than if it’s just in ordinary text. • TF/IDF • Term frequency: how often the search term shows up in a given record/document • Inverse document frequency: how rare the search term is in the whole mass of records/documents.
  • 22. Super- relevant! Record not “about” this term Overused word or stopword Irrelevant TF (one record) IDF (whole corpus) High Low Rare term Common term
  • 23. What other information can be used to gauge relevance? • People pointing • Google: PageRank, based on counting links to a document • Scholarly communication: many metrics based on later citation of articles • People choosing • Google also up-votes pages based on people clicking on them in search results. • Individual or social history of interests • Amazon, Netflix • Notice who’s doing this and who isn’t. • Serious question: what about privacy?
  • 24. http://xkcd.com/522
  • 25. Search engine optimization • Making sure that your page turns up in searches for relevant terms. • Done maliciously, this amounts to spam. Google spends LOTS of effort despamming its index. • Clean markup helps. So does putting highly relevant terms in highly visible/ important locations. • Also, don’t overload pages! Dilutes vocabulary.
  • 26. What else can you do with relevance information? • Point people to PEOPLE and SERVICES, not just search results! • Point people to context that will help them evaluate search results. • We know people just throw search terms at boxes. We might as well work with that. • This may well be the best work Forward is doing.
  • 27. A word about GIS • “Geographic Information Systems” • It’s metadata all the way down! Metadata about places. • Also a lot about how to represent and visualize that metadata. • And how to mash it up with other data. • Heavily based on relational-database technology. • HOT JOB MARKET. If you can get trained, do.
  • 28. Finding and using metadata standards • Nobody knows every metadata standard out there. I sure don’t. • But faced with a new standard, I may have to get up to speed fast. • I may even be making adoption decisions. • So here’s how I do it.
  • 29. Getting up to speed • Find its website. If it doesn’t have a website, you don’t want to use it. • Is the website current? Is there recent activity? • Is there a list of who’s using this standard? • Find a sample record. • How is this standard expressed? XML, RDF, what? • Does it pass a sniff test? • Find the documentation and community. • “Tag libraries” and “data dictionaries” especially helpful. • Primers, “getting started” documents also nice. • Look for tools. • Authoring/crosswalk tools (and programming libraries) • Validation tools