2. Metadata
• Metadata, literally “data about data,” has
become a widely used yet still frequently
underspecified term that is understood in
different ways by the diverse professional
communities that design, create, describe,
preserve, and use information systems and
resources
3. Metadata
• For the past hundred years at least, the creation
• and management of metadata has primarily been
the responsibility of information professionals
engaged in
– cataloging
– classification
– indexing
• But as information resources are increasingly put
online by the general public, metadata
considerations are no longer solely the province
of information professionals.
4. Metadata Creation
• Metadata creation is—or should often be—a
collaborative effort
• Until the mid-1990s, metadata was a term
used primarily by communities involved with
the management and interoperability of
geospatial data and with data management
and systems design and maintenance in
general
5. Metadata
• In general, all information objects, regardless
of the physical or intellectual form they take,
have three features—
– Content
– Context
– Structure—
• All of which can and should be reflected
through metadata.
6. Metadata
• Content relates to what the object contains or is
about and is intrinsic to an information object.
• Context indicates the who, what, why, where, and
how aspects associated with the object’s creation
and is extrinsic to an information object.
• Structure relates to the formal set of associations
within or among individual information objects and
can be intrinsic (basic, inherent, essential) or extrinsic
or both.
7. Metadata
• Cultural heritage information professionals such as
museum registrars, library catalogers, and archival
processors often apply the term metadata to the
value-added information that they create to arrange,
describe, track, and otherwise enhance access to
information objects and the physical collections related
to those objects.
• Such metadata is frequently governed by community-
developed and community-fostered standards and best
practices in order to ensure quality, consistency, and
interoperability (info exchange).
8.
9. Metadata Authors
• Metadata created by users
• Metadata created by trained information
professionals.
• Activities such as
– social tagging
– social bookmarking
• The resulting forms of user-created metadata
such as “folksonomies”
10. Web Search and Ranking
• How Web search engines work?
• How they use metadata, data, links, and
relevance ranking to help users find what they
are seeking?
11. Importance of Metadata
• Hardware and software come and go
• Sometimes becoming obsolete with alarming rapidity
• But high-quality, standards-based, system-independent
metadata can be
– Used
– Reused
– Migrated
– disseminated (spread)
in any number of ways,
• Even in ways that we cannot anticipate at this moment
13. Information Resources
• Our users are the primary reason that we
create digital resources.
• Exercise
• Need a list of Information resources and
related objects
14. 1. Photographs
2. Text Books
3. Journals
4. Researcher
papers
5. Audios
6. Articles
7. Magazines
8. Videos
9. Publications
10. Media
11. Blogs
12. Websites
13. Encyclopedias
14. Expert opinions
15. Databases
16. Newspapers
17. Almanacs
18. Conference
proceedings
19. Dictionaries
20. Encyclopedias
21. Handbooks
22. Diaries
23. Interviews
24. Letters
25. Original works
of art
26. Speeches
27. Works of
literature
28. Biographies
29. Dissertations
30. Indexes,
abstracts,
bibliographies
(used to locate a
secondary
source)
31. Journal articles
32. Monographs
15. Twitter Metadata
• Twitter has the following objects
– Users
– Tweets
– Entities : media, urls, user_mentions,
hashtags, symbols
16. Users Metadata
• Users can be anyone or anything.
They tweet, follow, create lists, have a
home_timeline, can be mentioned, and can
be looked up in bulk
• Metadata for the users contain the following
fields:
17. Users Metadata
• contributors_enabled
– Indicates that the user has an account with “contributor
mode” enabled
– Allowing for Tweets issued by the user to be co-authored
by another account.
• created_at
– The UTC datetime that the user account was created on
Twitter.
• default_profile
– indicates that the user has not altered the theme or
background of their user profile.
18. Users Metadata
default_profile_image
◦ When true, indicates that the user has not uploaded their
own avatar and a default egg avatar is used instead.
Id
◦ The integer representation of the unique identifier for this
User
Id_str
◦ The string representation of the unique identifier for this
User.
20. Tweets Metadata
A tweet is just 140 characters of text.
A tweet is filled with metadata
–information about when it was sent, by who, using what Twitter
application and so on.
Contributors
◦ An collection of brief user objects (usually only one) indicating
users who contributed to the authorship of the tweet
Coordinates
◦ Represents the geographic location of this Tweet as reported by
the user or client application
created_at
◦ UTC time when this Tweet was created.
21. Tweets Metadata
current_user_retweet
◦ Only surfaces on methods supporting
the include_my_retweetparameter, when set to true.
Entities
◦ Entities which have been parsed out of the text of the Tweet
Id
◦ The integer representation of the unique identifier for this
Tweet.
Id_Str
◦ The string representation of the unique identifier for this
Tweet
Text
◦ The actual UTF-8 text of the status update
• …more
22. Places Metadata
• locality the city the place is in
• region the administrative region the place is in
• iso3 the country code
• postal_code in the preferred local format for the place
• phone in the preferred local format for the place,
include long distance code
• Twitter twitter screen-name, without@
• url official/canonical URL for place
• app:id An ID or comma separated list of IDs
representing the place in the applications
place database.
23. Entities for Tweets: media entity
An array of media attached to the Tweet with the Twitter
Photo Upload feature.
id the media ID (int format)
id_str the media ID (string format)
media_url The URL of the media file
media_url_https The SSL URL of the media file
url The media URL that was extracted
display_url Not a URL but a string to display instead of the media URL
expanded_url The fully resolved media URL
sizes thumb, small, medium and large.
type Only photo for now
indices The character positions the media was extracted from
24. urls entity
• An array of URLs extracted from the Tweet text. Each URL
entity comes with the following attributes:
url
The t.co URL that was extracted from the Tweet
text
display_url
Not a valid URL but a string to display instead of
the URL
expanded_url The resolved URL
indices
The character positions the URL was extracted
from
25. user_mentions entity
id The user ID (int format)
id_str The user ID (string format)
screen_name The user screen name
name The user full name
indices
The character positions the user mention was extracted
from
26. hashtags entity
text The hashtag text
indices The character positions the hashtag was extracted from
27. symbols entity
• An array of financial symbols starting with the
dollar sign extracted from the Tweet tex
text The symbol text
indices The character positions the symbol was extracted from
28.
29. Photo metadata: a simple concept
There are 3 main categories of data:
• Administrative – identification of the creator, creation
date and location, contact information for licensors of
the image, and other technical details.
• Descriptive – information about the visual content.
This may include headline, title, captions and
keywords. This can be done using free text or codes
from a controlled vocabulary.
• Rights – copyright information and underlying rights in
the visual content including model and property rights,
and rights usage terms.
30. Classes Of Metadata
• Technical Metadata
1. Most modern image-capture devices generate
information about themselves and the pictures
they record, such as that stored in Exif.
2. These data describe an image’s technical
characteristics, such as its size, color profile, ISO
speed and other camera settings.
3. Some professional cameras can be configured to
add detailed ownership and descriptive
information.
31. Descriptive Metadata
A photographer or image collection manager can enter and embed various
information about an image’s contents.
This can include
1. captions,
2. headlines,
3. titles,
4. keywords,
5. location of capture, etc.
These metadata fields were included in the original IPTC-IIM schema.
Expanded in the IPTC Core and IPTC Extension metadata schemas.
Good descriptive metadata are key to unlocking an image collection to find stored
images.
32. Administrative Metadata
Image files can also include
1. licensing or rights usage terms,
2. Specific restrictions on using an image,
3. Model releases,
4. Source information, such as the identity of the creator,
and contact information for the rights holder or
licensor.
These types of metadata have been comprehensively
addressed and standardized within the PLUS system.
34. Weather Metadata
Observing site name: Aviemore
Location (deg lat, deg long): 57N, 3W
Elevation: 226m
Parameter observed: temperature
Operator: Met Office
Started: 11:50 01/01/01
Ended: 12:00 01/01/01
Value: 4
Units: Celsius
Instrument
instrument number: 123456
instrument inspection date: 01/01/01
instrument type: Magic mercury 1234
Data
Metadata
35. 1000 Genomes
• Example 1000 Genomes Data
• CHROM 4
• POS 42208061
• ID rs186575857
• REF T
• ALT C
• QUAL 100
• FILTER PASS
• INFO AA=T;AN=2184;AC=1;RSQ=0.8138;AF=0.0005;
• FORMAT GT:DS:GL
• GENOTYPE 0|0:0.000:-0.03,-1.19,-5.00
35
40. Metadata is key to ensuring that
resources will survive and continue to be
accessible into the future
41. • Metadata is structured information that
– describes,
– explains,
– locates, or
– otherwise makes it easier to retrieve,
– use,
– or manage
• an information resource.
• Metadata is often called
– data about data or
– Information about information.
42. term metadata usage
• Used differently in different communities.
• Some use it to refer to machine understandable
information, while
• others use it only for records that describe electronic
resources. In
– the library environment,
• Metadata is commonly used for any formal scheme of
resource description, applying to any type of
– object (digital or non-digital)
• Traditional library cataloging is a form of metadata;
43. MARC 21 & AACR
• MARC 21 and the rule sets used with it, such as AACR2,
are metadata standards.
• Format for Bibliographic Data
• MARC 21 (Machine Readable Cataloging)- 1999
Edition Update No. 1 (October 2000) through Update
No. 21 (September 2015) - Library of Congress
• AACR (Anglo American Cataloging Rules) and its allied
products are published jointly by the American Library
Association, the Canadian Library Association, and the
Chartered Institute of Library and Information
Professionals.
44. • Metadata have also been developed to
describe various types of textual and non-
textual objects including
– Published books
– electronic documents
– Archival finding aids
– Art objects
– Educational and training materials and
– Scientific datasets
45. Metadata Types
There are three main types of metadata:
• Descriptive metadata - describes a resource for
purposes such as discovery and identification.
• It can include elements such as
– title, abstract, author, and keywords.
• Structural metadata - indicates how compound
objects are put together, for example,
– How pages are ordered to form chapters.
• Administrative metadata – provides information to
help manage a resource, such as
– when and how it was created, File type and other technical
information, and who can access it.
46. Metadata Types
• There are several subsets of administrative
data; two that sometimes are listed as
separate metadata types are:
– Rights management metadata : which deals with
intellectual property rights, and
– Preservation metadata : which contains
information needed to archive and preserve a
resource
47. Aggregation
• Metadata can describe resources at any level
of aggregation. It can describe
– a collection,
– a single resource, or
– a component part of a larger resource (for
example, a photograph in an article).
• Just as catalogers make decisions about
whether a catalog record should be created
for a whole set of volumes or for each
particular volume in the set
48. Storing Metadata
• Metadata can be embedded in a digital object
or it can be stored separately.
• Metadata is often embedded in HTML
documents and in the headers of image files.
49. Storing Metadata
• Storing metadata with the object
– Ensures the metadata will not be lost,
– obviates problems of linking between data and metadata,
– and helps ensure that the metadata and object will be
updated together
• Storing metadata separately
– can simplify the management of the metadata itself and
– facilitate search and retrieval
• Therefore, metadata is commonly stored in a database
• system and linked to the objects described
50. What Does Metadata Do?
• An important reason for creating descriptive
metadata is to facilitate discovery of relevant
information
• In addition to resource discovery, metadata
can help organize electronic resources
• Facilitate interoperability and legacy resource
integration
• Provide digital identification, and
• support archiving and preservation
51. Resource Discovery
• Metadata serves the same functions in resource
discovery as good cataloging does by:
• Allowing resources to be found by relevant criteria
• Identifying resources
• Bringing similar resources together
• Distinguishing dissimilar resources and
• Giving location information
52. Organizing Electronic Resources
• As the number of Web-based resources grows
exponentially, aggregate sites or portals are increasingly
useful in organizing links to resources based on audience or
topic.
• Such lists can be built as static webpages, with the names
and locations of the resources “hardcoded” in the HTML.
• However, it is more efficient and increasingly more
common to build these pages dynamically from metadata
stored in databases.
• Various software tools can be used to automatically extract
and reformat the information for Web applications.
53. Interoperability
• Describing a resource with metadata allows it to be
understood by both humans and machines in ways
that promote interoperability.
• Interoperability is the ability of multiple systems with
different hardware and software platforms, data
structures, and interfaces to exchange data with
minimal loss of content and functionality.
• Using defined metadata schemes, shared transfer
protocols, and crosswalks (mapping) between
schemes, resources across the network can be
searched more seamlessly.
54. Digital Identification
• Most metadata schemes include elements such as standard
numbers to uniquely identify the work or object to which the
metadata refers.
– The location of a digital object may also be given using a file name, URL
(Uniform Resource Locator)
– Some more persistent identifier such as a PURL (Persistent URL or URI)
– DOI (Digital Object Identifier)
• Persistent identifiers are preferred because object locations
often change, making the standard URL (and therefore the
metadata record) invalid.
• In addition to the actual elements that point to the object, the
metadata can be combined to act as a set of identifying data,
differentiating one object from another for validation purposes.
55. Archiving and Preservation
• Most current metadata efforts center around
the discovery of recently created resources.
• However, there is a growing concern that
digital resources will not survive in usable
form into the future.
– Digital information is fragile
– It can be corrupted or altered, intentionally or
unintentionally.
– It may become unusable as storage media and
hardware and software technologies change.
56. Format Migration and Emulation
• Format migration and perhaps emulation of current hardware
and software behavior in future hardware and software
platforms are strategies for overcoming these challenges.
• Metadata is key to ensuring that resources will survive
and continue to be accessible into the future.
• Archiving and preservation require special elements
– to track the lineage of a digital object (where it came from
and how it has changed over time),
– to detail its physical characteristics, and
– to document its behavior in order to emulate it on future
technologies.
57. Structuring Metadata
• Metadata schemes (also called schema) are sets of metadata
elements designed for a specific purpose, such as
– describing a particular type of information resource.
• The definition or meaning of the elements themselves is known as
the semantics of the scheme
• The values given to metadata elements are the content
• Metadata schemes generally specify names of elements and their
semantics
• Optionally, they may specify
• content rules for how content must be formulated, for example,
how to identify the main title,
• representation rules for content , for example, capitalization rules,
and
• allowable content values, for example, terms must be used from a
specified controlled vocabulary.
58. Structuring Metadata
• There may also be syntax rules for how the elements and
their content should be encoded.
• A metadata scheme with no prescribed syntax rules is
called syntax independent.
• Metadata can be encoded in any definable syntax.
• Many current metadata schemes use
– SGML (Standard Generalized Mark-up Language) or
– XML (Extensible Mark-up Language).
• XML, developed by the World Wide Web Consortium
(W3C), is an extended form of HTML that allows for locally
defined tag sets and the easy exchange of structured
information.
• SGML is a superset of both HTML and XML and allows for
the richest mark-up of a document.
59. Metadata Schemes and
Element Sets
• Many different metadata schemes are developed in a
variety of user environments and disciplines.
• Some of the most common ones are
– Dublin Core Metadata Initiative (DCMI)
– The Text Encoding Initiative (TEI)
– Metadata Encoding and Transmission Standard (METS)
– Metadata Object Description Schema (MODS)
– Learning Object Metadata
– E-Commerce – <indecs> and ONIX
– Visual Objects – CDWA and VRA
– MPEG Multimedia Metadata
– Metadata schemes for datasets
60. Dublin Core
• The Dublin Core Metadata Element Set arose from
discussions at a 1995 workshop sponsored by OCLC
and the National Center for Supercomputing
Applications (NCSA).
• As the workshop was held in Dublin, Ohio, the element
set was named the Dublin Core.
• The continuing development of the Dublin Core and
related specifications is managed by the Dublin Core
Metadata Initiative (DCMI).
• The original objective of the Dublin Core was to define
a set of elements that could be used by authors to
describe their own Web resources.
61. Dublin Core Example
Title=”Metadata Demystified”
Creator=”Brand, Amy”
Creator=”Daly, Frank”
Creator=”Meyers, Barbara”
Subject=”metadata”
Description=”Presents an overview of metadata conventions in publishing.”
Publisher=”NISO Press”
Publisher=”The Sheridan Press”
Date=”2003-07"
Type=”Text”
Format=”application/pdf”
Identifier=”http://www.niso.org/standards/resources/Metadata_Demystified.pdf”
Language=”en”
62. The Text Encoding Initiative (TEI)
• The Text Encoding Initiative is an international project to develop
guidelines for marking up electronic texts such as novels, plays, and
poetry, primarily to support research in the humanities. In addition
to specifying how to encode the text of a work, the TEI Guidelines
for Electronic Text Encoding and Interchange also specify a header
portion, embedded in the resource, that consists of metadata about
the work. The TEI header, like the rest of the TEI, is defined as an
SGML DTD (Document Type Definition)— a set of tags and rules
defined in SGML syntax that describe the structure and elements of
a document.
• This SGML mark-up becomes part of the electronic resource itself.
Since the TEI DTD is rather large and complicated in order to apply
to a vast range of texts and uses, a simpler subset of the DTD,
known as TEI Lite, is commonly used in libraries.
• It is assumed that TEI-encoded texts are electronic versions of
printed texts.
63. Metadata Encoding and
Transmission Standard (METS)
• The Metadata Encoding and Transmission
Standard (METS) was developed to fill the need
for a standard data structure for describing
complex digital library objects.
• METS is an XML Schema for creating XML
document instances that express the structure of
digital library objects, the associated descriptive
and administrative metadata, and the names and
locations of the files that comprise the digital
object.
64. Metadata Object Description Schema (MODS)
• The Metadata Object Description Schema (MODS) is a descriptive
metadata schema that is a derivative of MARC 21 and intended to
either carry selected data from existing MARC 21 records or enable
the creation of original resource description records.
• It includes a subset of MARC fields and uses language based tags
rather than the numeric ones used in MARC 21 records.
• In some cases, it regroups elements from the MARC 21
bibliographic format.
• Like METS, MODS is expressed using the XML schema language.
• Although the MODS standard can stand on its own, it may also
complement other metadata formats.
This is an example of snp data in the 20110521 release. The info field contains information like the Ancesteral allele, allele count and number and Allele Frequency. The genotype field always first contains the individual genotype first which is a an index on an array of the reference and alternative alleles. Normally there is only 0 and 1 but if the variant is multi allelic there will be higher indexes too. The pipe symbol indicates this is a phased genotype, unphased genotypes are delimited by /. The other fields in the genotype column are generally measures of the genotype quality. In this instance the second field is a dosage measure from Mach/Thunder and the third field is a genotype likelihood giving a log likelihood for the 3 possible genotypes RR,RA,AA.