I Don’t Have Time for Metadata!

Bob Kasenchak
Project Coordinator
Access Innovations
bob_kasenchak@accessinn.com
@taxobob

DISCLAIMER
I Don’t Have Time for Metadata!

OUTLINE
• Data
• Structured Data
• Unstructured Data
• Metadata
• Subject Metadata
• Entity (author, institution) Metadata
• Document Type Metadata
• Automating Metadata
• Heuristic/Statistical/Inferential
• Rule-based

CASE STUDIES

STRUCTURED VS. UNSTRUCTURED DATA
Present different problems – and possible solutions – for
automatically adding metadata

Association,in view of abuses and lack of consistency
in published reports, has asserted that the all-inclusive
income statement,containing allincome items
recognized as determinantsof net income, is the answer
to these questions.2 The Securities and Exchange
Commission has also
strongly favored this solution.3 On the 1 Committeeon
Accounting Procedure, American
Instituteof Accountants, "Income and Earned Surplus,"
Accounting Research BulletinNo. 32 (December,
1947). 2 (1) "A TentativeStatementof Accounting
Principles Affecting Corporate Reports," THE
ACCOUNTING REvIEw, June, 1936, pp. 187-191; (2)
Accounting

<volume>325</volume>
<issue>5945</issue>
<fpage seq="c">1206</fpage>
<lpage>1206</lpage>
<history><date date-type="received"><day>26</day><month>02</month><year>2009
</year></date><date date-type="accepted"><day>11</day><month>08</month>
<year>2009</year></date></history>
<permissions>
<copyright-statement>Copyright © 2009</copyright-statement>
<copyright-year>2009</copyright-year>
<copyright-holder>Your name here</copyright-holder>
</permissions>
<abstract>
<p>Our extended ontogenetic growth model is a theoretical model based on conservation
of energy and general biological mechanisms underlying ontogenetic growth. We do not
believe that the comments of Makarieva <italic>et al</italic>. and Sousa <italic>et al
</italic>. expose substantive problems with our model. Nevertheless, they raise
interesting, still unresolved questions and point to philosophical differences about the role
of theory and of simple, general models as opposed to complicated, specific models.</p>
</abstract>

• Just extracting basic information
• Author
• Institution
• Title
• Document type
• Accession number(s)
…can be a challenge.
However…

• Predictability
• Positionality
Journal name/
Issue/Vol./etc.
Article Title
Copyright info
Author info
Abstract

UNSTRUCTURED DATA => STRUCTURED DATA!
<journal>Transactions on Vehicular Technology</journal>
<article-title>Relationship of Average Transmitted and Received Energies in Adaptive
Transmission</article-title>
<authors><author-surname>Kotelba</author-surname><author-firstname>Adrian</author-
firstname><affiliation>Member, IEEE</affiliation></authors>
<copyright-info><copyright-date>2009</copyright-date></copyright-info>
<abstract><p>This paper studies the…</p></abstract>
NOTE: Some cleanup may be required

• Basic information already tagged, labeled, and easy to
extract
• Author info
• Title
• Journal/Volume/Issue etc.
• We can add semantic (or subject) metadata
• Targeting only those parts of the text we require
• Title
• Abstract
• Full text body
• Exclude references, etc.

SEMANTIC METADATA
 Uncontrolled
 Automatic keyword extraction
 Crowdsourced/folksonomic tags
 Controlled – from a Thesaurus (or Taxonomy…)
 Inferential (Heuristic; Statistical)
 Rule-based

SEMANTIC METADATA: HOW?
 Controlled – from a Thesaurus (or Taxonomy…)
 Inferential (Heuristic; Statistical)
 Rule-based
 Manual tagging
 Automatic tagging

SEMANTIC METADATA: MANUAL ENTRY

SEMANTIC METADATA: MANUAL ENTRY
A Thought Experiment
• Let’s say a manual indexer can index 10 records/hour
• Let’s say the manual indexers are perfectly consistent (they’re not)
• Let’s say your manual indexers are paid $10/hour (good luck with that)
If you have 10,000 articles/pieces of content:
It would take a manual indexer 1000 hours (25 weeks) and cost $10,000
If you have 100,000 articles:
It would take a manual indexer 10,000 hours (250 weeks, or almost 5 years)
and cost $100,000
If you have 1,000,000 articles:
It would take a manual indexer 100,000 hours (~48 years) and $1,000,000

SEMANTIC METADATA: AUTOMATED

SEMANTIC METADATA: WHY?
 Disambiguate the ambiguous
 Specify most specific topics
 Improve information retrieval
 Search
 Browse
 Enable advanced analytics

SEMANTIC METADATA: DISAMBIGUATION
“Mercury”

SEMANTIC METADATA: SPECIFICATION
Beyond exact string matches: Synonymy
Fiber optic gyroscopes Fiber optic
gyros
Fiber-optic gyroscopes Fiber-optic
gyros
Fibre optic gyroscopes Fibre optic
gyros
Fibre-optic gyroscopes Fibre-optic
gyros
Fiberoptic gyroscopes Fiberoptic gyros
Optical fiber gyroscopes Optical fiber gyros
Optical fibre gyroscopes Optical fibre gyros
FOGs FOG’s

SEMANTIC METADATA: SPECIFICATION
Beyond exact string matches: Context. Matters.
 Indexing to most specific term
- Microscopes
- Electron microscopes
- Scanning electron microscopes

Improving information retrieval (Search, Browse)
SEARCH ≠ BROWSE

Improving information retrieval: Search
 Allows user to search by tags
 Ensures consistent and reliable retrieval
 Speeds electronic search

Subject
Metadata

Metadata-based
Search
Results
Based on
metadata

Improving information retrieval: Browse
Taxonomy
browse
Results
Based on
metadata

Improving information retrieval: Browse
Taxonomy
browse
Additional
Search
filters

Improving information retrieval: Analytics
 Combine subject metadata with metadata about
 Authors
 Institutions
 Publications (Journals, Magazines, etc.)
 Publication Types
…to create detailed informatics about your data, users,
authors, and whatever else is relevant or useful

Improving information retrieval: Analytics
Taxonomy
term
Narrower
terms
Broader
Term(s)
Authors who publish
on this topic

I DON’T HAVE TIME FOR METADATA!
Since Metadata allows you to do things you already have
want
need to do:
It’s always time for metadata.

Bob Kasenchak
Project Coordinator
Access Innovations
bob_kasenchak@accessinn.com
@taxobob
Thank you!

I Don’t Have Time for Metadata!

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to I Don’t Have Time for Metadata!

Similar to I Don’t Have Time for Metadata! (20)

More from Access Innovations, Inc.

More from Access Innovations, Inc. (20)

I Don’t Have Time for Metadata!