(Big) bibliographic data @ ScaDS project meeting - 2015-06-12

(Big) Bibliographic Data
UB Leipzig & SLUB Dresden
ScaDS project meeting, 12.6.2015
Leander Seige, Felix Lohmeier, Ralf Talkenberger

“The library of the
21st century
is a data hub.”
quoted from an internal strategic paper of
Leipzig University Library, 2015

simple bibliographic metadata
<metadata>
title
author
isbn
publisher
year
…
<resource>
books
serials
newspapers
articles
...

<resource> book
● printed books in the library’s shelves
● bought ebooks
● licensed ebooks
● pay-per-use ebooks
● free content
● ebooks to be bought by the library (patron driven acquisition = pda)
● even printed books to be bought by the library (pda too)

<resource> journals
● printed journals in the library’s shelves
● much more licensed electronic journals
○ full text accessible via web interfaces
● do we have article metadata?
● yes: licensed journal articles: 10s of millions per library

<metadata> accessibility information
● where is a ressource? (physical or on the net)
● who is allowed to access this content? (students? faculty? everyone?)
● is it available off-campus?
● did we buy it or is it just licensed?
● may the user copy or print it?
● is the library allowed to store the electronic file?
● may we grant access from wifi connections?
● ...or any combination of these...

<metadata> knowledge bases
● librarians built large knowledge bases to describe resources
● in german speaking countries: GND (Gemeinsame Normdatei) der
Deutschen Nationalbibliothek http://www.dnb.de/EN/gnd
● international: http://viaf.org
● provide dbpedia-links to explore the linked data cloud and to enrich
library data

<metadata> knowledge bases
● GND (and other national authority files via VIAF)
○ describe Persons, Corporate bodies, Conferences and Events,
Geographic Information, Topics, Works and relationships
between them
○ form a generic knowledge base, independent from any specific
domain
○ provide links to other knowledge bases (dbpedia, geonames...)

resource discovery
● traditional “OPACs” provided access to traditional library resources like
printed books, users had to use proprietary vendor drive portals to
access electronic ressources
● today, printed materials represent only a small part of library resources
● in contrast: resource discovery systems aim to integrate all
resources of a library and present them in one single search
interface

Cooperation
● UBL and SLUB joined forces in March 2015
● Goals:
a. Exchange of metadata after processing
b. Develop common workflows to avoid “double work”
→ integrate existing tools finc & d:swarm

finc Community
● maintains a large search engine infrastructure
● developed and hosted at Leipzig University Library
● based on Apache Solr und VuFind
● rugged metadata management system,
processing millions of data records each day
● integrates more than 50 data sources
https://finc.info

finc Community
● provides more than 15 university libraries with
resource discovery systems
● offers great potential to design and implement user oriented
functions on real world systems, serving thousands of library
users in Saxony and beyond, every day
● employs the aggregated index at Leipzig University Library
https://finc.info

10%
physical items
90%
electronic content
on the net
aggregated index at
Leipzig University Library

aggregated index at
Leipzig University Library
● 12 million traditional data records (growing)
● 80 million electronic article data records (growing)
● each records contains 20 data fields
1.8 billion triple
(if you triplify it)
(without any enrichment data)

Data processing today
● distributed data storage
○ 2 Solr in Leipzig
(~12 mio + ~80 mio records)
○ 2 Solr in Dresden
(~2 mio + ~2 mio records)
● constraint: each data source is
handled separately
→ difficult to build up relations
and deep data integration

d:swarm
● yet another tool…?
a. property graph database
b. gui for library staff

Tools
finc d:swarm
focus data normalization data integration and enrichment
technology script-based transformations
(python, go, ElasticSearch)
encapsulates metafacture (open
source toolchain for metadata
transformation)
Property Graph (Neo4j)
status Works fine with ~100 mio.
records (less than one day)
Scability issues (~ 4 mio. records in
less than one day)

integrating finc with d:swarm
● enhance data processing regarding
○ authority data linking (NLP)
○ fuzzy deduplication
○ classification
○ relate bibliographic data to places, topics, abstract terms
○ publish machine readable data (linked data)
● create user interfaces to enable system librarians to control metadata
processing

Tomorrow: common workflows
● All data flows through both tools (finc + d:swarm)
● Deduplication (in graphDB easier duplication recognition)
● FRBRization (aggregate different physical and formal versions of a
work)
● Knowledge graph makes enrichment (authorities, altmetrics data,
usage data, …) and analytics easier

Scalability issues
● current implementation of property graph is too slow
● test results with 64GB RAM, SSD, 16 cores
○ 1,2 mio records (flat format): 10 hours for complete workflow
(ingest, transformation, export)
○ more complex formats (MARC21) up to 5x statements
● single Neo4j instance, storage and memory issues

Possible solutions?
● “mit Hardware erschlagen”
● Another graphDB, parallelization?
○ ArangoDB: https://www.arangodb.com
○ Apache Giraph: http://giraph.apache.org
○ Blaze Graph: http://blazegraph.com (Wikidata’s choice)
● Gradoop?!

(Big) bibliographic data @ ScaDS project meeting - 2015-06-12

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

Similar to (Big) bibliographic data @ ScaDS project meeting - 2015-06-12 (20)

Recently uploaded

Recently uploaded (20)

(Big) bibliographic data @ ScaDS project meeting - 2015-06-12