How Metadata Gets Into a Knowledgebase

In and out: how does that
metadata get
into a knowledgebase anyhow?
Heather Sherman
Head of Library Programme Management – Dawson Books

Connect Group PLC
Creation process
2In and out: how does that metadata get into a knowledgebase anyhow?
Sign contract
with publisher
Acquire content
and basic
metadata
Correct
metadata
errors
Enhance basic
metadata
Create
ProQuest xml
feed
Create TOC
data

Connect Group PLC 3In and out: how does that metadata get into a knowledgebase anyhow?
Sign contract with publisher
Process starts with a publisher
agreeing to host their titles on
dawsonera.
Publishers are asked to send Dawson
the ebook content, jacket image and
associated metadata.
Some send this in xml. Others
complete a spreadsheet.

Publisher sends files of metadata
Publishers supply key pieces of metadata
 eISBN
 Title
 Subtitle
 Author(s)
 Price
 Currency
 PDF file name
 Jacket image
 Publisher
 Imprint
 Publication date
 Edition
 Country of publication
 Usage model

Connect Group PLC
Spreadsheet of metadata

Publisher sends files of metadata
However….
Not all publishers supply the key data,
so we have to go and find it.
Some supply incorrect data, so we
have to fix that.
Dawson’s automated import process
checks that key data is present and
correct, and reports on error.

Connect Group PLC
Metadata errors

Table of contents data created
PDF files are sent to an agency who
create Table of Contents (TOC) data.
For ePub files, the TOC is extracted
directly from the file.
TOC data is imported into the Dawson
system and matched up with the PDFs
and metadata.

Connect Group PLC
TOC xml

Metadata enhanced
Publisher metadata and TOC data is
matched to existing print records in
the Dawson title database.
Hybrid record is created incorporating
data from the publishers and Dawson.
Produces a record containing as much
information as Dawson have about the
title.

Connect Group PLC
Dawson ebook MARC record
=LDR 01354nam 2200349 4500
=001 DAW28874972
=007 cr
=008 140327s2014enkfs001|0|eng|d
=020 $a0191015024 (e-book)
=020 $a9780191015021 (e-book)
=040 $aStDuBDS$cStDuBDS$erda$dDAWSON
=041 1$aeng$hita
=082 04$a320.53209$223
=100 1$aPons, Silvio,$eauthor.
=245 14$aThe global revolution$h[electronic resource] : $ba history of international communism, 1917-1991 / $cSilvio Pons ;
translated by Allan Cameron.
=264 1$aOxford :$bOxford University Press,$c2014.
=300 $axx, 365 pages
=336 $atext$2rdacontent
=337 $acomputer$2rdamedia
=338 $aonline resource$2rdacarrier
=490 1$aOxford studies in modern European history
=500 $aTranslated from the Italian.
=504 $aIncludes bibliographical references and index.
=530 $aAlso available in printed form.
=533 $aElectronic reproduction.$cDawson Books.$nMode of access: World Wide Web.
=650 0$aCommunism$xHistory.
=650 0$aCommunism.
=655 7$aElectronic books.$2lcsh
=700 1$aCameron, Allan,$d1952-$etranslator.
=776 0$cHardback$z9780199657629
=830 0$aOxford studies in modern European history.

ProQuest feed created
Hybrid record is extracted and turned
into an xml record.
Dawson sends daily files of new titles
and updated data to ProQuest.
A weekly file of data for all titles is
sent.

Connect Group PLC
xml data sent to ProQuest
<document initial-page="4" jacket="9780191015021.jpg" lang="eng">
<eisbn>
<eisbn13>9780191015021</eisbn13>
<eisbn10>0191015024</eisbn10>
</eisbn>
<isbn-group>
<isbn10 type="hb">0199657629</isbn10>
<isbn13 type="hb">9780199657629</isbn13>
</isbn-group>
<title-group>
<title>The Global Revolution: A History of International Communism 1917-1991</title>
<subtitle>A History of International Communism 1917-1991</subtitle>
</title-group>
<author-group>
<author>
<person-name>Silvio Pons ; Translated By Allan Cameron.</person-name>
</author>
</author-group>

IN AND OUT: HOW DOES THAT
METADATA GET
INTO A KNOWLEDGEBASE ANYHOW?
Ben Johnson
Lead Metadata Librarian, KB Provider Data
Benjamin.Johnson@proquest.com
Acquisition and Ingestion of Provider Data
into a Knowledgebase (KB)

Introduction
What do I
do?
4/15/2015 15

Lots of times it feels more like this:
4/15/2015 16

Introduction
Acquire
• Get the data
• Verify
compatibility
• Map the data
Ingest
• Transform the
data
• Load
• Review
• Accept/Reject
Correct
• Customer
inquiries
• Content
integrity
• Product
interoperability
… Profit!
4/15/2015 17

Providers we partner with
Publishers
Content
Aggregators
(PQ, Gale)
University
and Library
Local
Content
Library
Consortia
(JISC,
BIBSAM)
4/15/2015 18

Content Acquisition
• No data
• No contracts
• Provider Relations
4/15/2015 19

KBART
• Joint NISO/UKSG Group
• Librarians, Vendors, Providers
• Transmission of metadata to vendors
• Human and machine readable data
• http://www.niso.org/workrooms/kbart
4/15/2015 20

Ingestion – mapping and transformation
• FTP, HTML
• CSV/Text, Excel, XML, HTML
Acquire
the data
• Data for existing content is mapped
to KB packages (new T&F package,
JISC/BIBSAM new license)
Create
packages
• Map the content to our schema
• Normalize the data (dates, diacritics)
Transform
the content
4/15/2015 21

XML Data from Dawsonera
4/15/2015 22

File ready for ingestion
4/15/2015 23

Ingestion – Loading and Review
4/15/2015 24

Currency (Updating)
Acquisition
IngestionReview
4/15/2015 25

Corrections
4/15/2015 26
Correcshunz Corrections

Downstream products
Data in KB
Downstream
Products
Product
functionality
Discovery Access
4/15/2015 27

IN AND OUT: HOW DOES THAT
METADATA GET INTO A
KNOWLEDGEBASE ANYHOW?
Dave Hovenden – Content Operations Manager, Summon
ProQuest
UKSG Conference – 30 March – 1 April, 2015

The Content Ingestion Streams for Summon
4/15/2015 29

The Content Ingestion Process at Summon for Commercial
Content
Identify New
Content to Add
into Summon
4/15/2015 30

• Product Management, Sales,
and our Global Content Alliance
work together to identify new
content to add into Summon
• New content requests from
Summon customers are also
considered
• Publishers and content
providers may also request to
have their content added into
Summon
4/15/2015 31
Identifying New Commercial Content to Add into Summon

Content
Identify New
Content to Add
into Summon
Engage with
Publisher/Provider
Pre-Agreement
Content Sample
Analysis
4/15/2015 32

• The sample analysis is used to
help determine the quality and
extent of the metadata and the
metadata schema
• We also try to determine things
such as linking methods, how
rights are assigned to the content,
and what databases we may need
in our knowledgebase (if they don’t
already exist)
• Summon often indexes content at
the article-level, or chapter-level
as that is usually the level of
granularity that the content is
supplied at
4/15/2015 33
Pre-Agreement Content Sample Analysis

What Metadata Do We Look For During Sample Analysis?
4/15/2015 34
Title Metadata
• Article titles, Book titles, Publication titles, Subtitles, etc.
Identifier Metadata
• Unique IDs for specific articles, chapters, etc.
• Publication-level unique identifiers such as ISSN or ISBN
• Additional identifiers such as OCLC Number, LCCN, Dewey, DOI, etc.
Publication Information Metadata
• Publisher, Author(s), Corporate Authors, Volume Numbers, Issue
Numbers, Start Page, Publication Date, Publication Series, etc.
Additional Metadata
• Subject Headings, Keywords, Language

Dawsonera Book Example – The Global Revolution: A History of
International Communism 1917-1991 (ISBN-13 – 9780199657629)
4/15/2015 35
<document initial-page="4" jacket="9780191015021.jpg" lang="eng">
<eisbn>
<eisbn13>9780191015021</eisbn13>
<eisbn10>0191015024</eisbn10>
</eisbn>
<territory-group/>
<parent-isbn/>
<isbn-group>
<isbn10 type="hb">0199657629</isbn10>
<isbn13 type="hb">9780199657629</isbn13>
</isbn-group>
<title-group>
<title>The Global Revolution: A History of International Communism 1917-1991</title>
<subtitle>A History of International Communism 1917-1991</subtitle>
</title-group>
<author-group>
<author>
<person-name>Silvio Pons ; Translated By Allan Cameron.</person-name>
</author>
</author-group>
<endnote-authors>
<endnote-author>Pons, Silvio,</endnote-author>
<endnote-author>Cameron, Allan,</endnote-author>
</endnote-authors>

Dawsonera Book Example (cont.) – The Global Revolution: A
History of International Communism 1917-1991 (ISBN-13 –
9780199657629)
<publisher>
<publisher-name>Oxford University Press</publisher-name>
<imprint>Oxford University Press</imprint>
</publisher>
<pub-place>GB</pub-place>
<pub-date>20140815</pub-date>
<date-added>20140911</date-added>
<first-published/>
<edition/>
<copyright>© Oxford University Press 2014</copyright>
<classification type="dewey">320.53209</classification>
<classification type="loc">HX40</classification>
<classification type="bic">HB</classification>
<series issn="" series-name="Oxford studies in modern European history." number-within-series="">Oxford studies in
modern European history.</series>
<abstract-text>The Global Revolution. A History of International Communism 1917-1991 establishes a relationship
between the history of communism and the main processes of globalization in the past century. Drawing on a wealth of
archival sources, Silvio Pons analyses the multifaceted and contradictory relationship between the Soviet Union and the
international communist movement, to show how communism played a major part in the formation of our modern world.
The volume presents the argument that during the age of wars from 1914 to 1945, the establishment of the Soviet state in
Russia and the birth of the communist movement had an enormous impact because of their promise of world revolution
and international civil war. Such perspective appeared even more plausible in the aftermath of the Second World War and
of revolution in China, which paved the way for the expansion of communism in the post-colonial world. Communism
challenged the West in the Cold War - by means of anti-capitalist modernization and anti-imperialist mobilization - showing
itself to be a powerful factor in the politicization of global trends. However, the international legitimacy of communism
declined rapidly in the post-war era. Soviet power exposed its inability to exercise hegemony, as distinct from domination.
The consequences of Sovietization in Europe and the break between the Soviet Union and China were the primary
reasons for the decline of communist influence and appeal. Since communism lost its political credibility and cultural
cohesion, its global project had failed. The ground was prepared for the devastating impact of Western globalization on
communist regimes in Europe and the Soviet Union.</abstract-text>4/15/2015 36

• Summon relies upon the
knowledgebase to help facilitate
rights access to the content
• Rights access is assigned by
tracking a particular title by ISSN
or ISBN in the knowledgebase, or
by Database ID
• The knowledgebase also helps
Summon indicate when content
has full-text availability
4/15/2015 37
Summon and the Knowledgebase

Content
Identify New
Content to Add
into Summon
Engage with
Publisher/Provider
Pre-Agreement
Content Sample
Analysis
Formalize and
Sign Data Sharing
Agreement
Data is Delivered
in Full from
Publisher/Provider
Data
Normalization,
Mapping, and
Enrichment
4/15/2015 38

Data Normalization, Mapping, and Enrichment Work
• Very basic high-level clean-up of the data to standardize it
• Examples include:
• Remove leading/trailing white spaces in Title and Subtitle fields
• Clean-up diacritics and other encoding issues
Data Normalization
• Map the metadata fields in the records to the Summon schema
• This allows the metadata to appear in the UI and/or be made searchable within Summon
Mapping
• Enriching the content by adding additional metadata when applicable
• Examples:
• Scholarly/peer-reviewed flags from Ulrich’s
• Citation counts from Scopus
• Book cover images from Syndetics
Enrichment
4/15/2015 39

Content
Identify New
Content to Add into
Summon
Engage with
Publisher/Provider
Pre-Agreement
Content Sample
Analysis
Formalize and Sign
Data Sharing
Agreement
Data is Delivered
in Full from
Publisher/Provider
Data
Normalization,
Mapping, and
Enrichment
Indexing
4/15/2015 40

The Title as it Appears in Summon Once Indexed
4/15/2015 41

Content
Identify New
Content to Add into
Summon
Engage with
Publisher/Provider
Pre-Agreement
Content Sample
Analysis
Formalize and Sign
Data Sharing
Agreement
Data is Delivered
in Full from
Publisher/Provider
Data
Normalization,
Mapping, and
Enrichment
Indexing
Post-Ingestion
Maintenance
4/15/2015 42

Post-Ingestion Maintenance
4/15/2015 43
Currency
• Currency is the process of the publisher/provider sending to Summon
new/updated metadata records, or record deletions for content that
need to be removed
• Frequency of providing updates is often at the discretion of the
publisher/provider
Metadata Issues
• Address reported issues of metadata quality from Summon customers
• Most issues involve incorrect metadata, or slight variations in the
metadata that may impact OpenURL linking or the record deduplication
process (Match & Merge)

Thank you – Any Questions?
Heather Sherman
Heather.sherman@dawsonbooks.co.uk
Benjamin Johnson
Benjamin.Johnson@proquest.com
Dave Hovenden
Dave.Hovenden@proquest.com

How Metadata Gets Into a Knowledgebase

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Metadata Gets Into a Knowledgebase

Similar to How Metadata Gets Into a Knowledgebase (20)

More from UKSG: connecting the knowledge community

More from UKSG: connecting the knowledge community (20)

Recently uploaded

Recently uploaded (20)

How Metadata Gets Into a Knowledgebase

Editor's Notes