This document discusses ISNI's CBS software which powers the ISNI database. It covers how the software handles searching, updating, loading data, and providing utilities. Key features include matching algorithms, linked data representations, batch loading of data from various sources, and tools for searching, resolving matches, and editing records online or through APIs.
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
ISNI behind-the-scenes operations and data management
1. cross-domain bridging-domains
Libraries
Text Rights
Trade Sources Music Rights
Encyclopaedias
Researchers & Professional
Granting organisations
Professional Societies
Article databases
Theses databases
Archives and
Museums
Harvard University Library 2014-11-18
2. ISNI Behind the scenes
• ISNI’s CBS software
• Performance
• Searching
• SRU enquiry API
• Indexes
• Linked data
• Updating
• Batch load
• Matching
• VIAF Update
• Web Cat
• AtomPub
• WinIBW
• Utilities
• Hunting anomalies
• Reports and statistics
• QT / End user interface
ISNI at Harvard
18 November 2014
Janifer Gatenby
Boaz Nadav Manes
OCLC EMEA
Harvard University Library 2014-11-18
3. CBS
Centraal
Bibliotheek
Systeem
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
4. Tailoring and building
• Loading
• Export
• Matching & merging
• Web OPAC
• Web cataloguing
• Cataloguing client
• SRU
• SRU update (Atom Pub)
• Hosting infrastructure
ISNI ‘s CBS software
• Data definition (based on VIAF
Searching
MARC)
• Input formats (tab and XML)
• Indexes
• Matching (based on VIAF)
• Public / Private data mix
• Statistics
• QT / end user interoperability
• Reports
• Fix jobs (pseudonyms, reports ++)
Update
Utilities
Harvard University Library 2014-11-18
5. Enquiries Dutch Union Catalogue
ISNI 14,000 per day cf GGC 50 per second
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
6. •
ISNI Data Definitions
ISNI ‘s CBS software
Searching
Update
Utilities
http://www.isni.org/filedepot_download/140/390
Harvard University Library 2014-11-18
Name Use Organisation
legalName
acronymn
nickname
assignedName
transliteratedName
disused name
commonForm (default)
Name Use Person
Public
public and private
private
fictional character
Unknown
7. Searching indexes
ISNI ‘s CBS software
Examples
your seedocument
code and > update date December 2013
Cn: ams & upd: > 201312
Your code and another’s code
Cn: jnam & cn: proq
Name Keyword not your code
Nw: trobe not cn: auvlu
Almost anything can be indexed
Also available by SRU API
See document ISNI search guidelines.doc
http://www.isni.org/content/documents-related-database-enquiry
Searching
Update
Utilities
Harvard University Library 2014-11-18
8. Browse ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
9. Search by SRU API
See Document:
ISNI SRU search API guidelines.doc
Example search by name keyword (pica.nw):
ISNI ‘s CBS software
http://isni.oclc.nl/sru/?query=pica.nw+%3D+%22maloy%2Brebecca%22
&operation=searchRetrieve&recordSchema=isni-b
This search is for the any records containing both “Rebecca” and
“Maloy” in the name
Response in XML enquiry response schema. ISNI enquiry response
v2.xsd
Searching
Update
Utilities
Harvard University Library 2014-11-18
10. SRU API Enquiry Response ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
11. ISNI ‘s CBS software
Searching
Update
Utilities
SRU API Enquiry Response
Harvard University Library 2014-11-18
12. Member View – see all data except private data ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
13. Member view – additional data displayed
(if not private)
• Nationality
• Gender
• Keyword or key phrase
• Dewey classification
• Publisher
• Dates active
• Associated countries
• Provisional records
• Including links to possible matches, if applicable
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
14. Private Data
• Dates
• Personal Affiliations
• Titles of works
ISNI ‘s CBS software
Searching
Update
Utilities
Rights management Societies may not reveal data. Their legal contracts
do not allow it. Though they make the data available for matching.
Trade organisations may choose not to reveal their titles supplied to
retain commercial advantage.
Private data hides behind publicly available data. 14 of 30 sources are
fully public and more than 90% of records contain public sources.
Harvard University Library 2014-11-18
15. Fully Public sources in Green
GENERAL SOURCES
Bowker Books in Print BOWKER
ISNI (Generated, adopted, made by QT) ISNI
The European Library (48 national
libraries)
TEL
VIAF (33 libraries) VIAF
RIGHTS MANAGEMENT
Access Copyright, Canada ACCE
Authors’ Guild AGLD
Authors’ Licensing and Collecting
Society, UK
ALCS
Centrum Dienstverlening Auteurs- en
aanverwante Rechten, Netherlands
CEDA
Centro Español de Derechos
Reprográficos
CEDR
Irish Copyright Licensing Agency ICLA
Prolitteris, Switzerland PROL
VG WORT, Germany VGWO
MUSIC
American Musicological Society AMS
British Library Sound Archive BLSA
International Performers’ Database
Association
IPDA
MusicBrainz MUBZ
RESEARCHERS AND PROFESSIONALS
American Musicological Society AMS
British Library Theses BRTH
Digital Author identifier, Netherlands DAI
Jisc Names Project, UK JNAM
La Trobe University AU:VLU
Modern Languages Association MLA
OCLC Theses OCLCT
ORCID and DataCite Interoperability
Network
ODIN
AuthorClaim and RePec OPENL
Proquest Theses PROQ
Scholar Universe, Proquest SCHU
Electronic tables of content ZETO
ORGANISATIONS
Boekenbank, Belgium BOEK
Bowker Publishers BOWP
Publishers Licensing Society, UK PLS
Ringgold RING
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
16. ISNI as linked data ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
18. Documents
relating to
enquiry
ISNI search guidelines
ISNI SRU search API guidelines
ISNI SRU search API guidelines - public.doc
ISNI XML enquiry response schema
ISNI Access Comparison Public and Member
Getting started with PSI queries
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
19. Scalable Quality Ecosystem
ISNI Database
Harvested, Batch loaded; Online contributions
Algorithms
Notifications
Data fixing
Sampling
Data Policy
Enrichment
Correction
Curation
Crowd
sourcing
Data contributors
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
20. Assigned
8.69 million
Provisional: Possible
700,815
Provisional: Unassigned
9,287, 278
Assigned ISNIs November 2014
VIAF + non VIAF sources 4,870,099
3+ VIAF sources 428,988
2+ sources (not VIAF) 315,915
Unique name 2,735,449
Trusted single source (JISC,
BOEK, RING) 342,231
Total 8,692,683
Authoritative,
Unique,
Trustful,
Persistent
8.24 million persons
446,258 organisations
+ % confidence
- % confidence
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
21. Confidence
The two main problems for maintaining persistence are
• duplicates needing to be merged
• undifferentiated identities needing to be split
ISNI ‘s CBS software
ISNI errs on the side of making duplicates rather than mixed identities
Thus the batch load process (usually) makes a provisional record
• where there is no match (for fear of making a duplicate assignment)
• where there is a low confidence match (for fear of making a mixed
identity or a duplicate assignment)
• where a matching record already has another local ID for the same
source, regardless of the strength of the match (for fear of making a
mixed identity)
Searching
Update
Utilities
Harvard University Library 2014-11-18
23. ISNI Assignment: Batch loading
Unique name
Single source
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
24. ISNI Matching
Name
Title
Partial title
Rare title word
Date
Publisher
Personal affiliation
Organisation affiliation
ISBN, ISWC, ISAN, DOI +
Other name identifier e.g. IPI, VIAF, IPD
Instrument
Linked entities
Dewey classification
ISNI ‘s CBS software
Searching
Update
Utilities
Scores are collected from each judge
Overall score computed; lowered where
• common surnames
• common titles
• if not much on which to match
Score > .85 = match
Score >.6 but <.85 = possible match
Harvard University Library 2014-11-18
25. Building similarity vectors
Trying alternatives
to traditional rules
based matching
Working with Article
First data
Will load high
confidence data to
ISNI with traditional
matching
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
26. VIAF import
• VIAF issues a full file monthly
• Compare with previous full file
• Deletes / additions / Sources changed / contents
changed
• Deletes
• delete only if not assigned
• Remove VIAF and mark for re-import
• If VIAF only source, change source to ISNI
• Cluster movement reports
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
27. Documentation: Data Submission
Documents relating to data
submission
ISNI tab delimited format
ISNI tab delimited format organisations
ISNI data element values
ISNI XML request schema
ISNI XML request schema document
ISNI Atom Pub interactive request requirements
ISNI Data contributors usage guidelines
ISNI database source profiles RAG information
ISNI bulk load submission
Documents relating to data
submission output
ISNI ‘s CBS software
ISNI XML response schema
ISNI XML response schema document
ISNI XML notification schema
bulk load assigned ISNIs.xsd
bulk load ISNI not assigned.xsd
bulk load too many matches.xsd
ISNI Data contributors reports and notifications guidelines
Searching
Update
Utilities
Harvard University Library 2014-11-18
28. Procedures for maximizing assignment
• Refinement of matching algorithms
• E.g. introduced rare title word;
• Now ignoring date of birth 1900
• Re-import program
• Rematch with new rules
• Rematch after new data added
• ISNI Quality Team: Data sampling
• assessing impact of single source
• Recommendations for program changes
• New criteria
• Assessing uncommon surname assignment
• Rules for online rich assignment
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
29. Online: Guarantee assignment – Personal Name
ISNIs will be automatically assigned where there are no possible
matches in these cases:
There are matches with a database record with a different source
A personal name is unique and includes a surname and
forename
The request includes an “isNot” statement
The metadata supplied is considered rich as per these cases:
• Full date of birth and death supplied
• Year of birth + 1 title or instrument+ 1 related name (co-author
or affiliated institution)
• 1 title or instrument + 1 external URL link of type
encyclopaedia, home page (not social network page) + 1
related name (co-author or affiliated institution)
The request is resolving a possible match by including a PPN
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
30. Online: Guarantee assignment – Organisation Name
ISNIs will be automatically assigned where there are no possible
matches in these cases:
There are matches with a database record with a different source
An organisation name is unique and does not consist only of
abbreviations
The metadata supplied is considered rich as per these cases:
• Includes LOCODE &
• Organisation type &
• Organisation URL
The request is resolving a possible match by including a PPN
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
31. Maximizing assignment
ISNI ‘s CBS software
Searching
Enter a request record online (Web page or via API)
Batch loaded records – passive method
Update
Utilities
May 2012 % assigned Oct 2014 % assigned
• Quality Team manual fixes
• OCLC periodic re-match runs
• Matches from later batch loading & online activity
ALCS 41,523 63.86% 49,157 76.66%
PROL 2,205 35.24% 4,143 66.18%
PROQ 65,122 12.89% 243,481 48.19%
Batch loaded records – active method
• Resolve possible matches found by the system
• Search the database for candidate records for merging
• Enrich a record with URLs to external sources such as author’s
web pages, Wikipedia, IMDB, MusicBrainz, Discogs, etc.
May 2012 % assigned Oct 2014 % assigned
AUVLU 0 0% 1,716 48.28%
ICLA 0 0% 2,208 97.61%
Harvard University Library 2014-11-18
32. Resolving Possible Matches ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
33. Compare Screen ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
34. Adding a new record – Michel Calame ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
35. Adding a new record ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
36. Adding your source to an existing record
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
37. Atom Pub API (Machine to machine)
• Requests and replacements (you can replace your existing data citing local identifier)
• Request
• Atom Pub Header
• Content = Request in the ISNI XML Request schema
• Documentation
• ISNI Atom Pub API guidlines.doc
• ISNI request.xsd (XML schema)
• ISNI request schema.doc (describes the schema)
• ISNI response.xsd (XML schema)
• ISNI response schema.doc (describes the schema)
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
38. WinIBW – QT Tool
• Sees whole records
• Can edit and delete all data
• Can force merge
• Macros with VB scripts
• Download records or selected fields
• E.g. identify a set of records for re-import
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
39. Hunting anomalies
ISNI ‘s CBS software
• Strange source combinations
• Lived > 120 years; published before 10 years
• Mismatching main names
• Browse index and Hitrange
• DOB 1900-
• Theses & dead < 1950
• Matching failures, e.g. TEL, Bowker, VIAF
Searching
Update
Utilities
40. Utilities
• Re-import a set of records
• Delete source from a set of records
• Pseudonym fix
• Move from name variant to related name
• Link related name to other record
• Generate other record if it doesn’t exist
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
41. End User Note
ISNI ‘s CBS software
Dear Sir / Madam, The ISNI 0000000117488848 refers to "Marco Antonio
Casanova", Professor at the Catholic University of Rio de Janeiro. I am not
the author of "Fragmentos póstumos. - Nietzsche uma introdução
filosófica" or "Segunda consideração intempestiva da utilidade e
desvantagem da história para a vida". The author of these works is "Marco
Antonio dos Santos Casa Nova". You may confirm this information by
consulting our CVs at the Brazilian Research Council: Marco Antonio
Casanova
(me): http://lattes.cnpq.br/0400232298849115 Marco Antonio dos Santos
Casa Nova
(the other author): http://lattes.cnpq.br/3409704326617178
Searching
Update
Utilities
Harvard University Library 2014-11-18
42. Reports and Notifications
• Bulk reports
• Basic
• Enriched
• Notifications
• Ad hoc reports
• Report generator
• WinIBW download
• Statistics
ISNI ‘s CBS software
See document ISNI Data contributors reports and notifications guidelines.doc
http://www.isni.org/content/documents-related-data-submission-output
Searching
Update
Utilities
Harvard University Library 2014-11-18
43. Statistics
Basic statistics
Cross matches
VIAF matches
ISNI ‘s CBS software
Searching
Update
Utilities
Harvard University Library 2014-11-18
Editor's Notes
Multiple domains
Updating
Batch load
VIAF updates
Web cat
AtomPub Request API
WinIBW
High end software – only used by union catalogues and national libraries. 2 of the installations are hosted in the OCLC Leiden office
Software users are all partners; they have the source code and because at the beginning neither the Dutch nor the German systems were using the same data definitions, the system is highly configurable.
Runs on Unix and Linux, re-engineerd twice – for Unix, for Unicode
Mostly in C, Sybase – limited use for the bibliographic data
We have full confidence that the system is suitable for the scale of ISNI
Scissors: published by the Free Software Foundation; (Wikimedia commons)
Scaffolding: Tamal Das (Wikimedia commons)
By using the existing CBS system that is highly configurable, we were able to achive
The slide shows the monitoring that is done by the OCLC Leiden Data centre of the Dutch union catalogue. The system is regularly receiving more than 50 requests per second.
ISNI has nowhere nearly reached this level of traffic.
The data definition of ISNI is MARC based, using VIAF variations plus some new additions. The ISNI document data element values defines the valid values for certain elements. Here we are showing Name use as an example. These definitions will be made available as an ontolot\gy for all to use in link data.
The blue box indicates the location of documentation on enquiry of the database. The indexes availbale in the members view are extensive, permitting multiple views of the data. Example are in the pink box – e.g. combining a source code with update data, or with another source.
The system also includes a browse capability on most indexes. On the left is the name index and on the right a more unusual index, permitting to find records by number of sources.
Machine to machine enquiry. This is available in both private and public mode
Showing some of the XML schema that is sent in an SRU response
Showing some of the XML schema that is sent in an SRU response
ISNI members see more detail that the public; they see all but the private data
Though there is private data, each piece of data may appear multiple times with a source code, thus the same information may be public from one source and private from another. In effect the private data sits behind the public data.
Only 270,000 assigned records do not have a public source (mostly unique name). The majority of ISNI data is public. Even those records where there is private data, the core metadata is shown.
ISNI supports a persistent URL in the form isni.org/isni/<ISNI> if there is a trailing space on the end then a content negotiation page is presented
These are the ontologies used in the ISNI data returned. The ISNI ontology is necessary, in particular for the concept “public identity”
The resources tab on the isni.org web page – list of documents related to enquiry
Living online database. ISNI’s system has a focus on quality. The data contributors load data and are responsible for the quality of their own data. OCLC as the assignment agency
For batch loading, ISNI assignment is not guaranteed.
Criteria for assignment: 2 or more independent sources or 3 VIAF sources, or the name is unique. Unique name assignment requires the forenames to be complete (i.e. not initials), and the metadata to not be sparse. Some sources whose data is trusted to be fully differentiated and deduplicated are used as base data and single source assignment.
For online assignment applications ISNI assignment is assured providing that the data passes the sparseness test and an assertion is made that the database has already been searched.
ISNI errs on the side of making duplicates rather than mixed identities. It is particularly careful in batch loading, more trusting in online assignment
there is a match with a record already on the database at a sufficient confidence level as determined by the software.
If there are multiple definite matches, the best candidate is chosen and the record is merged with it
The source profile can indicate that all non matching and non ambiguous records should be assigned
Unique name assignment (the name string is personal and unique, contains at least one full forename and the record is not sparse)
The system offers a configurable rules based matching capability. It is possible to configure new rules as we invent them then to also configure the way the central judge acts. Each rule returns a result independently, like an ice skating competition.
As a complement and alternative to rules based matching we are experimenting with matching article data from Article First with similarity vectors looking at the data as a blob. To make an analogy, rules based matching looks at the data as a salad with each component clearly visible whereas vector matching looks at it as a soup and produces a probability score. We will take only the highly confident data and load it to ISNI, matching with existing ISNI data in the traditional way.
Each month a new file is issued by VIAF. It is always a complete file. We compare the new file with the last one loaded. A delete is determined if the VIAF id is in the old file but not the new and an insert if the VIAF id is in the new file but not the old. We also detect the clusters that have a different mix of sources and the clusters where only the contents have changed. We load to ISNI in that sequence. The processing of deletes is complicated – firstly we remove all fields marked with VIAF source from the record. We only delete the record IF VIAF was the only source and IF the record was not assigned. If VIAF was the only source and the record is assigned, then we change the source to ISNI. We add a VIAF delete field to the record then rematch it at the end of the load.
Re-importing is like taking the record out of the database with a rubber band. It only pretends to leave the database.
We also make cluster movement reports for LC and BnF – showing where their IDs have moved in and out of VIAF clusters. We examine the Out lines, sample them to see which ones are due to VIAF cluster merges and which ones are simply cluster changes. VIAF will not admit two loadl IDs from the same source into a cluster and this tends to make duplicate clusters.
Only available for Registration Agencies
This is a typical input from an end user of the ISNI database. The requests are coming in on average 2-3 a day. The requests are almost all very high quality as per above and most (to our surprise) include an email so that we respond with the action taken. ISNI also engages to notify all sources in case of a fixed error.