Presentation to AIIM First Canadian Chapter on April 29, 2015. Concepts for better understanding of metadata, controlled vocabularies, and taxonomies for enterprise content.
7. METADATA DEFINED
▪ Coined in the 1960’s by Jack Myers
▪ Data about Data
▪ Stuff about Stuff
▪ Essential properties stored within the content or external to the content
that identify and define context, history, and management of the
content
9. APPLICATION OF METADATA
▪ Metadata is
▪ applied to all structured and unstructured content in a corpus
▪ visible to the user or it can be hidden from view
▪ both machine-driven and manually entered
▪ internal or external to the content
▪ mandatory, optional, or conditional
10. MANY FORMS OF METADATA
▪ Corporate metadata is structured data about content
▪ Metadata is relational or hierarchical
▪ Metadata may take the form of
▪ Rich-text or binary
▪ Plain-text
▪ Controlled values/pick-lists/lookup values
▪ Syntax encoded values
▪ date/time (e.g., yyyy-mm-dd hh:mm:ss)
▪ financial ($0.00, -$0.00)
▪ numeric - integer/floating values (#,###)
▪ boolean (true/false)
▪ special (phone numbers, postal codes, or social insurance numbers)
Metadata
11. MANY ROLES OF METADATA
▪ The primary role of metadata is to facilitate the identification, retrieval,
and processing of content in any media.
▪ Secondarily, metadata may also
▪ appear as content to the content consumer, and
▪ serve as corporate structured data for analysis and business intelligence.
Metadata
13. METADATA ISN’T
THE MESSAGE
▪ Twitter post
(118 chars)
▪ Twitter status message
metadata (1,938 chars)
{"id"=>12296272736
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"created at"=>"Fri Apr 16 17:55:46 +0000 2010",
"in_reply_to_user_id"=>nil,
"in_reply_to_screen_name"=>nil,
"in_reply_to_status_id"=>nil,
"favorited"=>false,
"truncated"=>false,
"user"=>
{"id"=>6253282,
"screen_name"=>"twitterapi"
"name"=>"Twitter API",
"description"=>
"The Real Twitter API. I tweet about API changes, service issues and happily answer questions about
Twitter and our API. Don't qet an answer? It's on my website.",
"url"=>"http://apiwiki.twitter.com",
"location"=>"San Francisco, CA",
"profile_background_color"=>"cldfee",
"profile_background_image_url"=>
"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png ",
"profile_background_tile"=>false,
"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",
"profile_link_color"=>"0000ff",
"profile_sidebar_border_color"=>"87bc44",
"profile_sidebar_fill_color"=>"e0ff92",
"profile_text_color"=>"000000",
"created_at"=>"Wed May 23 06:01:13 +0000 2007",
"contributors_enabled"=>true,
"favourites_count"=>1
"statuses_count"=>1628
"friends_count"=>13
"time_zone"=>"Pacific Time (US & Canada)",
"utc_offset"=>-28800,
"lang"=>"en",
"protected"=>false,
"followers_count"=>100581,
"geo_enabled"=>true,
"notifications"=>false,
"following"=>true
"verified"=>true}
"contributors"=>[3191321]
"geo"=>nil
"coordinates"=>nil
"place"=>
{"id"=>"2b6ff8c22edd9576",
"url"=>"http ://api.twitter.com/1/geo/id/2b6ff8c22ed9576.json",
"name">"SoMa",
"full_name"=>"SoMa, San Francisco",
"place_type"=>"neighborhood",
"country_code"=>"US",
"country "=>"The United States of America",
"bounding_box"=>
{"coordinates"=>
[[[-122.42284884, 37.76893497],
[-122 .3964, 37.76893497],
[-122.3964, 37.78752897],
[-122.42284884, 37.78752897]]],
"type"=>"Polygon"}},
"source"=> "web"}
An early look at Annotations:
http://groups.google.com/group/twitter-api-
announce/browse_thread/thread/fa5da2608865453
14. WHY METADATA
MATTERS
Collection and use of metadata has been
known to be controversial when viewed out of
context of the content it carries.
Electronic Frontier Foundation
30 December 2013
Metadata Importance of Metadata▪ They know you rang a phone sex service
at 2:24 am and spoke for 18 minutes.
But they don’t know what you talked
about.
▪ They know you called the suicide
prevention hotline from the Golden
Gate Bridge. But the topic of the call
remains a secret.
▪ They know you spoke with an HIV
testing service, then your doctor, then
your health insurance company in the
same hour. But they don’t know what
was discussed
17. ▪ Classification is the ordering of entities (things or concepts) into groups
or classes on the basis of their similarity
▪ an activity that we do everyday
▪ metadata and controlled vocabularies are tools that can be used for
classification
THINKING ABOUT CLASSIFICATION
18. analyst brake market stapler
seat traders alternator investor
calculators scissors engine pedal
dashboard pen backers marker
tape profit starter ruler prospects
THINKING ABOUT CLASSIFICATION
How many words can you
memorize in 20 seconds?
19. analyst brake market stapler
dashboard pen backer marker
seat trader alternator investor
pedalcalculator scissors engine
tape profit starter ruler prospect
THINKING ABOUT CLASSIFICATION
1. Filter out all of the noise
21. dashboardalternator pedal
brake seatengine starter
marker
staplerscissorstape
pen calculatorruler
analyst market backer
investor
traderprofitprospect
THINKING ABOUT CLASSIFICATION
3. Organize words by similarities
22. dashboardalternator pedal
brake seatengine starter
marker
staplerscissorstape
pen calculatorruler
analyst market backer
investor
traderprofitprospect
Stock market Office supplies
Car parts
THINKING ABOUT CLASSIFICATION
4. Classify and label groups
23. THINKING ABOUT CLASSIFICATION
Stock market Office supplies Car parts
analyst stapler brake
market calculator seat
trader scissors dashboard
investor pen engine
backer marker alternator
profit tape starter
prospect ruler pedal
How well did you do?
24. THINKING ABOUT CLASSIFICATION
Vegetables Computer parts Instruments
peas hard drive violin
endive sound card harp
carrots monitor piano
spinach mouse trumpet
celery processor cello
broccoli flash drive flute
tomato keyboard guitar
Now how many words can you
memorize in 20 seconds?
25. CONTROLLED VOCABULARIES
▪ Some metadata requires a classification, controlled list of values or terms to
define it, for example:
▪ Film rating: G, PG, 14A, 18A, R, A
▪ Ebay seller location:
▪ Control is exercised over modifications to the list
26. Controlled vocabularies defined
▪ A list of terms
▪ All terms in a controlled vocabulary must
have an unambiguous, non-redundant
definition. (Source: ANSI/NISO Z39.19-2005)
Controlled Vocabularies
What is a controlled vocabulary?
Why use controlled vocabularies?
Types of controlled vocabularies
27. BRIDGING BOUNDARIES -
WHICH TERM IS “RIGHT”?
Accessible parking spaces
Accessible permit parking
Disabled permit parking
Designated disabled parking spaces
Handicapped parking
Disabled parking
spaces
28. TOWARDS A COMMON
VOCABULARY
Accessible parking spaces
Accessible permit parking
Disabled permit parking
Designated disabled parking spaces
Handicapped parking
Disabled parking spaces
31. TYPES OF CLASSIFICATION SCHEMES
▪ Subject
▪ Identify content topics
▪ Organization Structure
▪ Depicts business units
▪ Functional
▪ Defined by business processes
32. SUBJECT TAXONOMIES
▪ Describes the topic of the resource
▪ Structured from broad to narrow / general to specific
▪ Often stable over time
34. ORGANIZATION CLASSIFICATION
▪ Shows business unit relationships
▪ Can be used to identify:
▪ Ownership of content
▪ Maintenance responsibilities
▪ A person’s place in the organization
▪ Often change frequently
36. FUNCTIONAL CLASSIFICATION
▪ Describes the breakdown of business processes
▪ Function – Activity - Task
▪ Stable in nature unless new processes or functions are introduced
Taxonomy
46. ADMINISTRATIVE METADATA
▪ Information about the metadata record itself – its creation,
modification, relationship to other records, etc.
▪ Audit trails may capture the date and time when a file’s title was changed.
▪ Common subsets of administrative metadata are:
▪ Rights Management: metadata that deals with intellectual property rights
▪ Preservation: information needed to archive / preserve a resource
Source: Understanding Metadata – NISO 2004
49. ABOUT STRUCTURAL METADATA
▪ Describe the structure of a resource
▪ Book
▪ Document
▪ Website
▪ Table of contents
▪ Site map
▪ Internal structure
50. WHAT IS XML?
▪ (eXtensible Markup Language) is an open
standard for the exchange of information
▪ first published in 1996 by W3C
▪ to encode electronic documents readable by
▪ human, and
▪ machine
▪ for a multitude of applications ranging from
▪ corporate financial reporting applications, to
▪ Microsoft Word
52. WHAT ARE MARKUP LANGUAGES?
▪ pre-date desktop publishing and the Internet
▪ tell computers how to handle data
▪ such as how to render electronic content on a page
▪ categorized as either
▪ presentation, or
▪ semantic markup
53. PRESENTATION MARKUP
▪ With electronic presentation markup, we markup the
paragraph and italicize the citation for publication
▪ This is typical of web pages using hypertext markup (HTML)
The Cancer Journal: The Journal of Principles & Practice of
Oncology provides an integrated view of modern oncology across
all disciplines.
<p><i>The Cancer Journal: The Journal of Principles & Practice
of Oncology</i> provides an integrated view of modern oncology
across <i>all</i> disciplines.</p>
The Cancer Journal: The Journal of Principles & Practice of Oncology provides an
integrated view of modern oncology across all disciplines.
54. SEMANTIC MARKUP
▪ With semantic markup, we markup the content to describe the meaning
of the text
▪ Publishing stylesheets interpret the meaning from the markup and apply
appropriate styles specific to the publishing context
The Cancer Journal: The Journal of Principles & Practice of
Oncology provides an integrated view of modern oncology across
all disciplines.
<intro><cite>The Cancer Journal: The Journal of Principles &
Practice of Oncology</cite> provides an integrated view of
modern oncology across <em>all</em> disciplines.</intro>
The Cancer Journal: The Journal of Principles & Practice of Oncology provides an
integrated view of modern oncology across all disciplines.
The Cancer Journal: The Journal of Principles & Practice of Oncology provides an
integrated view of modern oncology across all disciplines.
55. SEMANTIC MARKUP
▪ Using semantic markup, we
can
▪ disambiguate content
▪ search based on meaning
▪ connect to other content, and
▪ reuse or substitute new text.
57. INTELLIGENT CONTENT
▪ Content that is
▪ not limited to one
▪ purpose
▪ technology, or
▪ output
▪ structurally rich and semantically aware, making it
▪ discoverable
▪ reusable
▪ reconfigurable, and
▪ adaptable.
59. Communicating the
benefits
Demonstrating interoperability
with business examples
Keywords Fort York; children, soldier, history
Creator Jose San Juan
Asset Credit City of Toronto
Headline
A British soldier in historical red
uniform salutes children at Fort York
60. Communicating the
benefits
Demonstrate reuse with
business examples
write Headline once using DAL
or Adobe CS: “A British soldier
in historical red uniform salutes
children at Fort York”
Reuse Headline
during design, as
alt-tag for screen
readers (to comply
with AODA)
Reuse Headline
to search for files
in DAL
62. “Let me tell you how dangerous it is to design a
classification scheme. It’s very dangerous. I have
suffered.
People attribute all kinds of motives to you. Apart from
that, if anything goes wrong, they will pounce upon
you.”
– Melvil Dewey
63. Dublin Core Metadata Standard
International Press Telecommunications Council – Photo Metadata
Adobe XMP – Extensible Metadata Platform
Rules for Archival Description
64. DUBLIN CORE
▪ maintains a vocabulary of metadata properties and encoding schemes
▪ core set of 15 properties for use in describing resources:
Metadata
Contributor
Coverage
Creator
Date
Description
Format
Identifier
Language
Publisher
Relation
Rights
Source
Subject
Title
Type
65. ISO METADATA STANDARDS
▪ ISO 23081 – Metadata for Records
▪ Recommendations for metadata required to manage records
▪ Metadata about the record itself
▪ Metadata about the business rules or policies and mandates
▪ Metadata about the agents
▪ Metadata about business activities or processes
▪ Metadata about records management processes
Metadata
66. ISO 2788 – DEVELOPMENT OF
MONOLINGUAL THESAURI
• Latest edition published in 1986
• Media- and Language-Agnostic
• Applicable across both broad and narrow
subject areas and describes how to deal with
multiple domains
• Intended to ensure consistency of practice
across different agencies
• Provides recommendations rather than
mandatory instructions
• Outlines optional procedures for many special
cases where a standard approach may not be
applicable
Thesaurus