With library collections now predominantly electronic, there is more and more reliance on ‘knowledgebases’, those databases of metadata about e-resources that are provided by suppliers of e-resource management software (ERM), as well as by community organisations such as Jisc. This panel, made up of an e-book supplier, a metadata librarian and a discovery service repository manager, will provide the audience with a view of what it takes to actually get metadata from the supplier of the e-resource through the ingest and editorial processes of the knowledgebase provider and into the discovery service.
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
How Metadata Gets Into a Knowledgebase
1. In and out: how does that
metadata get
into a knowledgebase anyhow?
Heather Sherman
Head of Library Programme Management – Dawson Books
2. Connect Group PLC
Creation process
2In and out: how does that metadata get into a knowledgebase anyhow?
Sign contract
with publisher
Acquire content
and basic
metadata
Correct
metadata
errors
Enhance basic
metadata
Create
ProQuest xml
feed
Create TOC
data
3. Connect Group PLC 3In and out: how does that metadata get into a knowledgebase anyhow?
Sign contract with publisher
Process starts with a publisher
agreeing to host their titles on
dawsonera.
Publishers are asked to send Dawson
the ebook content, jacket image and
associated metadata.
Some send this in xml. Others
complete a spreadsheet.
4. Connect Group PLC 4In and out: how does that metadata get into a knowledgebase anyhow?
Publisher sends files of metadata
Publishers supply key pieces of metadata
eISBN
Title
Subtitle
Author(s)
Price
Currency
PDF file name
Jacket image
Publisher
Imprint
Publication date
Edition
Country of publication
Usage model
6. Connect Group PLC 6In and out: how does that metadata get into a knowledgebase anyhow?
Publisher sends files of metadata
However….
Not all publishers supply the key data,
so we have to go and find it.
Some supply incorrect data, so we
have to fix that.
Dawson’s automated import process
checks that key data is present and
correct, and reports on error.
8. Connect Group PLC 8In and out: how does that metadata get into a knowledgebase anyhow?
Table of contents data created
PDF files are sent to an agency who
create Table of Contents (TOC) data.
For ePub files, the TOC is extracted
directly from the file.
TOC data is imported into the Dawson
system and matched up with the PDFs
and metadata.
9. Connect Group PLC
TOC xml
9In and out: how does that metadata get into a knowledgebase anyhow?
10. Connect Group PLC 10In and out: how does that metadata get into a knowledgebase anyhow?
Metadata enhanced
Publisher metadata and TOC data is
matched to existing print records in
the Dawson title database.
Hybrid record is created incorporating
data from the publishers and Dawson.
Produces a record containing as much
information as Dawson have about the
title.
11. Connect Group PLC
Dawson ebook MARC record
=LDR 01354nam 2200349 4500
=001 DAW28874972
=007 cr
=008 140327s2014enkfs001|0|eng|d
=020 $a0191015024 (e-book)
=020 $a9780191015021 (e-book)
=040 $aStDuBDS$cStDuBDS$erda$dDAWSON
=041 1$aeng$hita
=082 04$a320.53209$223
=100 1$aPons, Silvio,$eauthor.
=245 14$aThe global revolution$h[electronic resource] : $ba history of international communism, 1917-1991 / $cSilvio Pons ;
translated by Allan Cameron.
=264 1$aOxford :$bOxford University Press,$c2014.
=300 $axx, 365 pages
=336 $atext$2rdacontent
=337 $acomputer$2rdamedia
=338 $aonline resource$2rdacarrier
=490 1$aOxford studies in modern European history
=500 $aTranslated from the Italian.
=504 $aIncludes bibliographical references and index.
=530 $aAlso available in printed form.
=533 $aElectronic reproduction.$cDawson Books.$nMode of access: World Wide Web.
=650 0$aCommunism$xHistory.
=650 0$aCommunism.
=655 7$aElectronic books.$2lcsh
=700 1$aCameron, Allan,$d1952-$etranslator.
=776 0$cHardback$z9780199657629
=830 0$aOxford studies in modern European history.
11In and out: how does that metadata get into a knowledgebase anyhow?
12. Connect Group PLC 12In and out: how does that metadata get into a knowledgebase anyhow?
ProQuest feed created
Hybrid record is extracted and turned
into an xml record.
Dawson sends daily files of new titles
and updated data to ProQuest.
A weekly file of data for all titles is
sent.
13. Connect Group PLC
xml data sent to ProQuest
<document initial-page="4" jacket="9780191015021.jpg" lang="eng">
<eisbn>
<eisbn13>9780191015021</eisbn13>
<eisbn10>0191015024</eisbn10>
</eisbn>
<isbn-group>
<isbn10 type="hb">0199657629</isbn10>
<isbn13 type="hb">9780199657629</isbn13>
</isbn-group>
<title-group>
<title>The Global Revolution: A History of International Communism 1917-1991</title>
<subtitle>A History of International Communism 1917-1991</subtitle>
</title-group>
<author-group>
<author>
<person-name>Silvio Pons ; Translated By Allan Cameron.</person-name>
</author>
</author-group>
13In and out: how does that metadata get into a knowledgebase anyhow?
14. IN AND OUT: HOW DOES THAT
METADATA GET
INTO A KNOWLEDGEBASE ANYHOW?
Ben Johnson
Lead Metadata Librarian, KB Provider Data
Benjamin.Johnson@proquest.com
Acquisition and Ingestion of Provider Data
into a Knowledgebase (KB)
17. Introduction
Acquire
• Get the data
• Verify
compatibility
• Map the data
Ingest
• Transform the
data
• Load
• Review
• Accept/Reject
Correct
• Customer
inquiries
• Content
integrity
• Product
interoperability
… Profit!
4/15/2015 17
18. Providers we partner with
Publishers
Content
Aggregators
(PQ, Gale)
University
and Library
Local
Content
Library
Consortia
(JISC,
BIBSAM)
4/15/2015 18
20. KBART
• Joint NISO/UKSG Group
• Librarians, Vendors, Providers
• Transmission of metadata to vendors
• Human and machine readable data
• http://www.niso.org/workrooms/kbart
4/15/2015 20
21. Ingestion – mapping and transformation
• FTP, HTML
• CSV/Text, Excel, XML, HTML
Acquire
the data
• Data for existing content is mapped
to KB packages (new T&F package,
JISC/BIBSAM new license)
Create
packages
• Map the content to our schema
• Normalize the data (dates, diacritics)
Transform
the content
4/15/2015 21
28. IN AND OUT: HOW DOES THAT
METADATA GET INTO A
KNOWLEDGEBASE ANYHOW?
Dave Hovenden – Content Operations Manager, Summon
ProQuest
UKSG Conference – 30 March – 1 April, 2015
30. The Content Ingestion Process at Summon for Commercial
Content
Identify New
Content to Add
into Summon
4/15/2015 30
31. • Product Management, Sales,
and our Global Content Alliance
work together to identify new
content to add into Summon
• New content requests from
Summon customers are also
considered
• Publishers and content
providers may also request to
have their content added into
Summon
4/15/2015 31
Identifying New Commercial Content to Add into Summon
32. The Content Ingestion Process at Summon for Commercial
Content
Identify New
Content to Add
into Summon
Engage with
Publisher/Provider
Pre-Agreement
Content Sample
Analysis
4/15/2015 32
33. • The sample analysis is used to
help determine the quality and
extent of the metadata and the
metadata schema
• We also try to determine things
such as linking methods, how
rights are assigned to the content,
and what databases we may need
in our knowledgebase (if they don’t
already exist)
• Summon often indexes content at
the article-level, or chapter-level
as that is usually the level of
granularity that the content is
supplied at
4/15/2015 33
Pre-Agreement Content Sample Analysis
34. What Metadata Do We Look For During Sample Analysis?
4/15/2015 34
Title Metadata
• Article titles, Book titles, Publication titles, Subtitles, etc.
Identifier Metadata
• Unique IDs for specific articles, chapters, etc.
• Publication-level unique identifiers such as ISSN or ISBN
• Additional identifiers such as OCLC Number, LCCN, Dewey, DOI, etc.
Publication Information Metadata
• Publisher, Author(s), Corporate Authors, Volume Numbers, Issue
Numbers, Start Page, Publication Date, Publication Series, etc.
Additional Metadata
• Subject Headings, Keywords, Language
35. Dawsonera Book Example – The Global Revolution: A History of
International Communism 1917-1991 (ISBN-13 – 9780199657629)
4/15/2015 35
<document initial-page="4" jacket="9780191015021.jpg" lang="eng">
<eisbn>
<eisbn13>9780191015021</eisbn13>
<eisbn10>0191015024</eisbn10>
</eisbn>
<territory-group/>
<parent-isbn/>
<isbn-group>
<isbn10 type="hb">0199657629</isbn10>
<isbn13 type="hb">9780199657629</isbn13>
</isbn-group>
<title-group>
<title>The Global Revolution: A History of International Communism 1917-1991</title>
<subtitle>A History of International Communism 1917-1991</subtitle>
</title-group>
<author-group>
<author>
<person-name>Silvio Pons ; Translated By Allan Cameron.</person-name>
</author>
</author-group>
<endnote-authors>
<endnote-author>Pons, Silvio,</endnote-author>
<endnote-author>Cameron, Allan,</endnote-author>
</endnote-authors>
37. • Summon relies upon the
knowledgebase to help facilitate
rights access to the content
• Rights access is assigned by
tracking a particular title by ISSN
or ISBN in the knowledgebase, or
by Database ID
• The knowledgebase also helps
Summon indicate when content
has full-text availability
4/15/2015 37
Summon and the Knowledgebase
38. The Content Ingestion Process at Summon for Commercial
Content
Identify New
Content to Add
into Summon
Engage with
Publisher/Provider
Pre-Agreement
Content Sample
Analysis
Formalize and
Sign Data Sharing
Agreement
Data is Delivered
in Full from
Publisher/Provider
Data
Normalization,
Mapping, and
Enrichment
4/15/2015 38
39. Data Normalization, Mapping, and Enrichment Work
• Very basic high-level clean-up of the data to standardize it
• Examples include:
• Remove leading/trailing white spaces in Title and Subtitle fields
• Clean-up diacritics and other encoding issues
Data Normalization
• Map the metadata fields in the records to the Summon schema
• This allows the metadata to appear in the UI and/or be made searchable within Summon
Mapping
• Enriching the content by adding additional metadata when applicable
• Examples:
• Scholarly/peer-reviewed flags from Ulrich’s
• Citation counts from Scopus
• Book cover images from Syndetics
Enrichment
4/15/2015 39
40. The Content Ingestion Process at Summon for Commercial
Content
Identify New
Content to Add into
Summon
Engage with
Publisher/Provider
Pre-Agreement
Content Sample
Analysis
Formalize and Sign
Data Sharing
Agreement
Data is Delivered
in Full from
Publisher/Provider
Data
Normalization,
Mapping, and
Enrichment
Indexing
4/15/2015 40
41. The Title as it Appears in Summon Once Indexed
4/15/2015 41
42. The Content Ingestion Process at Summon for Commercial
Content
Identify New
Content to Add into
Summon
Engage with
Publisher/Provider
Pre-Agreement
Content Sample
Analysis
Formalize and Sign
Data Sharing
Agreement
Data is Delivered
in Full from
Publisher/Provider
Data
Normalization,
Mapping, and
Enrichment
Indexing
Post-Ingestion
Maintenance
4/15/2015 42
43. Post-Ingestion Maintenance
4/15/2015 43
Currency
• Currency is the process of the publisher/provider sending to Summon
new/updated metadata records, or record deletions for content that
need to be removed
• Frequency of providing updates is often at the discretion of the
publisher/provider
Metadata Issues
• Address reported issues of metadata quality from Summon customers
• Most issues involve incorrect metadata, or slight variations in the
metadata that may impact OpenURL linking or the record deduplication
process (Match & Merge)
44. Thank you – Any Questions?
Heather Sherman
Heather.sherman@dawsonbooks.co.uk
Benjamin Johnson
Benjamin.Johnson@proquest.com
Dave Hovenden
Dave.Hovenden@proquest.com
Editor's Notes
Hello, I am Ben Johnson, the Lead Metadata Librarian for the Provider Data side of our Knowledgebase. I manage the team responsible for getting and maintaining the Provider Data in our knowledgebase (KB). The KB drives all of our ERM and access products such as Intota and the 360 Suite of products (Link, Resource Manager, etc.), and provides rights data to our discovery layer, Summon. It’s what makes it so you don’t always search everything in Summon, only what you have access to. I am also the co-chair for the KBART Standing Committee along with Magaly Bascones from JISC/KB+.
If you’re a World of Warcraft player and can tell me what’s going on here, I might have a job for you.
As you’ll see, our process is, at a high level, very much similar to Heather’s at Dawsonera. We acquire the data from the content providers, ingest it into our systems (which is a multi-step process as we’ll see in a minute), and make corrections to the data so that it is as accurate as we are able to get it and so that it works well with our downstream products.
And stuff happens, blah blah blah profit.
We work with many different kinds of content providers, from the traditional publishers such as Taylor & Francis, Oxford, Cambridge, Springer- and content aggregators such as EBSCO, Dawsonera, Gale- to self-publishing/hosted content at universities and libraries, to library consortia who provide us data about their members’ licensed deals, such as Jisc through their KB+ platform (who I’m sure you’re familiar with as they provide license data for UK institutions), and BIBSAM, who provide us with similar packaged content through KB+ for Swedish institutions.
Content acquisition is the trickiest part of our entire process. We can’t serve customers with good data if we don’t have any data to begin with. Unlike Dawsonera, for the KB, we do not pay anyone to give us metadata. Unlike both Dawsonera and Summon, we do not acquire full text data, so there are generally no contracts or other agreements that are signed to use that data. Someone (either someone on my team, someone on our Provider Relations team, or a customer) needs to convince the content provider to provide us with metadata so that our mutual customers can manage, access and discover the provider’s data using our products. This increases the usage of the provider’s products in turn, which is how we are able to get them on board.
Here’s where, as the co-Chair, I feel like I need to make a plug for KBART (KnowledgeBases And Related Tools), a joint NISO/UKSG Working Group comprised of librarians, vendors and content providers who have come up with a set of recommendations for the transmission of title-level metadata between content providers, vendors, and libraries. The chief use is for populating Knowledgebases with provider content, but an overlooked secondary use is as a human-readable format for use by librarians. Now, even though it’s called a “title list”, the focus of the data is not in cataloging or thorough description of the title; the titles are instead attributes that describe the database/package of content being sold or accessed through the content provider, including what those titles are, where they can be accessed and browsed, and, in the case of serials, the date ranges that are available through that package of content. If you are interested in all of the details of the recommended practices, which also give a good overview of what kinds of data is tracked and used by Knowledgebase, methods of transmission of that data, and related issues around title list metadata and the quality (or absence of) that data, I recommend that you take a look at the NISO KBART site. We encourage all of the providers that we work with to develop KBART title lists, as it is the quasi-standard for this kind of data. We’ll take data if it is not a KBART list (or if it doesn’t meet all of the requirements), but KBART really is preferred.
[incorporate something about “it’s OK if it’s not KBART, but….”]
So we get the data from the provider via their website, an FTP site, or some other method, using a content acquisition tool to automate the process. In the KB, we represent as closely as possible the Provider’s packages (as recommended by KBART). In the case of consortia such as JISC and Bibsam we will provide a mapping for member libraries to be able to discern which package in the KB they should have access to, provided it’s not consortium-specific content. We also often offer A-Z lists for libraries for a la carte/cherrypicking titles, since that is also a popular method for purchasing content, for example, through Dawson. Now we need to prepare the data for ingestion in to our system. We take the content (most often coming in CSV or tab-delimited text files, occasionally XML or HTML scraped content from web sites), and map it to our schema. We transform the data to be compatible with our systems, for example, standardizing date formats and converting HTML characters to Unicode.
We have:
Flattened the structure; greatly reduced the number of fields from the original XML. This avoids ingesting fields that we don’t need.
We’ve mapped the XML fields to our schema. We are only using the Print (hardback) ISBN here, as that identifier works much better to match the provider record in the title list to an authoritative MARC record as part of our title reconciliation process (which we also call normalization).
Data normalization – removed initial articles, also aiding in that MARC record matching process (basically anything that would be considered a nonfiling character in MARC). We’ve also mapped the content to our internal database code- that’s what you see on the right there – 20A – which is the Dawsonera database in the KB.
After the data is mapped and transformed, it is loaded and compared with the current data set that we have from the provider. The system compiles a list of changes (or a delta) and these changes are reviewed, sometimes automatically, sometimes by a human, to ensure the integrity of the content. Once those changes are vetted and the content looks good, the data load is accepted and written to our database/the KB.
We can’t just load a set of content once and be done, though. The data is very fluid and it needs to be refreshed continually in order to be of any real use. Some content needs to be updated more frequently, such Dawsonera and other large aggregated ebooks providers, others less frequently. Our average and typical update or currency cycle is monthly, as this is how often most providers give us new data to load. Providers such as Jisc and Bibsam go so far as to notify us when there are changes, however we usually just pull automatically at the appropriate update interval.
[Currency]
Once the data is in the KB, we keep our eyes on it and make sure it behaves. If customers let us know that something is wrong with that set of content, then we will take steps to make those corrections to ensure that the data is as accurate as possible and works well with our downstream products.
Once the data is in the KB, downstream products- Intota, 360 suite, Summon – use the data in the KB in various ways, whether it be to link directly to the content, provide a base of information for a library to manage their content, or to, based on a library’s subscriptions in ERM (populated by the KB), to drive discovery and access of that content.
With that I’ll turn the mic over to my colleague, Dave Hovenden.