Part two of the workshop given by Rene Voorburg and Juliette Lonij at the launch event of the new website of the KB Lab on 11 April 2017. Part one of the workshop is available here; https://www.slideshare.net/KBNLResearch/rene-voorburg-and-juliette-lonij-search-download-and-analyse-kb-data-sets-a-practical-introduction
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Rene Voorburg - Using KB APIs to collect data
1. Using KB APIs to collect data
KB websites like Delpher.nl or GeheugenvanNederland.nl use the KBGA
infrastructure to search, retrieve and display digital content.
This infrastructure is also available to Digital Humanities scholars.
KBGA can be described as a set of datatypes and APIs, providing
dataservices.
2. Example dataset: ANP Radiobulletins
• About 1.5 million digitized typoscripts from radio news broadcasts between
1937 and 1984
• Available through the Delpher website as Radiobulletins collection
• KB offers the data under (semi) open licenses:
• CC0-license for the metadata, CC-BY-NC-ND-licenses for images and
full-text objects
3. Datatypes in the KBGA Infrastructure
• Structure metadata in MPEG21 DIDL format
4. Datatypes in the KBGA Infrastructure
• Structure metadata in MPEG21 DIDL format
5. Datatypes in the KBGA Infrastructure
• Structure metadata in MPEG21 DIDL format
6. Datatypes in the KBGA Infrastructure
• Structure metadata in MPEG21 DIDL format
• Descriptive metadata in the Dublin Core (DC) vocabulary,
with some KB-specific extensions (‘DCX’)
• ALTO files with content and layout information
• OCR files containing only the textual content of the item
• One or more images of the object, for example in JPEG format
7. KBGA APIs
KBGA consists of three main functional components / groups of APIs:
1. The metadata associated with objects (descriptive, structural) is stored in a
OAI-PMH compatible metadata storage (KB-MDO).
2. The digital objects or files referred to by the structure metadata (images,
ALTO, OCR) are stored in a central web storage (KB-WO) accessible
through resolver links.
3. In order to search, the KB provides a search index from metadata and full-
text objects (jSRU).
9. 1. KB-MDO - OAI-PMH repository
• Based on the OAI-PMH (Open Archives Initiative Protocol for Metadata
Harvesting) protocol
• Harvesting metadata of a single record or an entire set
• Administrative information to keep copy up to date.
10. KB-MDO OAI-PMH examples 1/2
• http://services.kb.nl/mdo/oai?verb=ListSets
ListSets: Provides an overview of all 'sets'.
• http://services.kb.nl/mdo/oai?verb=ListIdentifiers&set=anp&metadataPrefix=didl
ListIdentifiers: Gives the first 'batch' of MPEG21 DIDL records of the ANP set.
• http://services.kb.nl/mdo/oai?
verb=ListIdentifiers&set=anp&metadataPrefix=didl&from=2008-10-01
Similar to the previous request, but now restricted to records that were last updated
on 2008-10-01 or later.
11. KB-MDO OAI-PMH examples 2/2
• http://services.kb.nl/mdo/oai?verb=ListIdentifiers&resumptionToken=anp!
2008-10-03T11:11:50.639Z!!didl!3932926
Returns the next 'batch' of identifiers for the previous request.
• http://services.kb.nl/mdo/oai?verb=GetRecord&identifier=anp:anp:
1937:07:08:8:mpeg21&metadataPrefix=didl
GetRecord: Retrieves one specific MPEG21 DIDL record.
12. Tip: Harvesting KB-MDO
A simple tool for harvesting metadata from KB-MDO (or any other OAI-PMH
repository) is oai2linerec (https://github.com/renevoorburg/oai2linerec)
./oai2linerec.sh -s anp -p didl -b http://services.kb.nl/mdo/oai -o output.txt
14. 2. Using resolver links - ANP example
• Persistent identifier in the form of a resolver link:
http://resolver.kb.nl/resolve?urn=anp:1973:10:18:44:mpeg21
• Image can be retrieved by adding the :image suffix:
http://resolver.kb.nl/resolve?urn=anp:1973:10:18:44:mpeg21:image
• OCR can be retrieved by adding the :ocr suffix:
http://resolver.kb.nl/resolve?urn=anp:1973:10:18:44:mpeg21:ocr
• ALTO file can be retrieved using the :alto suffix:
http://resolver.kb.nl/resolve?urn=anp:1973:10:18:44:mpeg21:alto
16. 3. jSRU for search
• If you do not wish to retrieve an entire set, but are more interested in getting
results for a particular search query
• Based on the SRU (Search and Retrieval via URL) standard protocol
• Uses CQL (Contextual Query Language) as its query language
17. Using jSRU 1/3 - basic example
http://jsru.kb.nl/sru/sru?operation=searchRetrieve&x-
collection=ANP&query=nobelprijs
Base URL followed by query string with a number of parameters:
• Parameter operation with value searchRetrieve to specify the operation
• Parameter x-collection with value ANP to select the collection to search
• Parameter query to enter the search query ‘nobelprijs’
18. Using jSRU 2/3 - more parameters
• With the operation parameter set to explain, some general information
about the service can be obtained
• Adjust the number of results returned with the maximumRecords
parameter
• Use the startRecord parameter if you want to start viewing the result set
from a particular record onwards
19. Using jSRU 3/3 - CQL
• Keyword string: query="nobelprijs literatuur"
• OR- or AND-queries: query=nobelprijs AND literatuur
• Wildcards: query=nobelpr*
• Restrict to a particular field: query=date=01-01-1960
• Restrict to period in time: query=date within "01-01-1960 01-01-1961"
20. Exercise - Collect ANP data
To play with the Frame generator, collect OCR files from the ANP-collection.
1. Define a basic jSRU search query that defines a matching subset for your
research.
2. Perform this query for various periods in time.
3. Use the resolver to get the OCR (~ 3 per period)
Use ‘Save as…’ in your browser.
Apply the .xml-extension.
See http://lab.kb.nl/workshop-using-frame-generator