4. <resource> book
● printed books in the library’s shelves
● bought ebooks
● licensed ebooks
● pay-per-use ebooks
● free content
● ebooks to be bought by the library (patron driven acquisition = pda)
● even printed books to be bought by the library (pda too)
5. <resource> journals
● printed journals in the library’s shelves
● much more licensed electronic journals
○ full text accessible via web interfaces
● do we have article metadata?
● yes: licensed journal articles: 10s of millions per library
6. <metadata> accessibility information
● where is a ressource? (physical or on the net)
● who is allowed to access this content? (students? faculty? everyone?)
● is it available off-campus?
● did we buy it or is it just licensed?
● may the user copy or print it?
● is the library allowed to store the electronic file?
● may we grant access from wifi connections?
● ...or any combination of these...
7. <metadata> knowledge bases
● librarians built large knowledge bases to describe resources
● in german speaking countries: GND (Gemeinsame Normdatei) der
Deutschen Nationalbibliothek http://www.dnb.de/EN/gnd
● international: http://viaf.org
● provide dbpedia-links to explore the linked data cloud and to enrich
library data
8. <metadata> knowledge bases
● GND (and other national authority files via VIAF)
○ describe Persons, Corporate bodies, Conferences and Events,
Geographic Information, Topics, Works and relationships
between them
○ form a generic knowledge base, independent from any specific
domain
○ provide links to other knowledge bases (dbpedia, geonames...)
9. resource discovery
● traditional “OPACs” provided access to traditional library resources like
printed books, users had to use proprietary vendor drive portals to
access electronic ressources
● today, printed materials represent only a small part of library resources
● in contrast: resource discovery systems aim to integrate all
resources of a library and present them in one single search
interface
10. Cooperation
● UBL and SLUB joined forces in March 2015
● Goals:
a. Exchange of metadata after processing
b. Develop common workflows to avoid “double work”
→ integrate existing tools finc & d:swarm
11. finc Community
● maintains a large search engine infrastructure
● developed and hosted at Leipzig University Library
● based on Apache Solr und VuFind
● rugged metadata management system,
processing millions of data records each day
● integrates more than 50 data sources
https://finc.info
12. finc Community
● provides more than 15 university libraries with
resource discovery systems
● offers great potential to design and implement user oriented
functions on real world systems, serving thousands of library
users in Saxony and beyond, every day
● employs the aggregated index at Leipzig University Library
https://finc.info
14. aggregated index at
Leipzig University Library
● 12 million traditional data records (growing)
● 80 million electronic article data records (growing)
● each records contains 20 data fields
1.8 billion triple
(if you triplify it)
(without any enrichment data)
15. Data processing today
● distributed data storage
○ 2 Solr in Leipzig
(~12 mio + ~80 mio records)
○ 2 Solr in Dresden
(~2 mio + ~2 mio records)
● constraint: each data source is
handled separately
→ difficult to build up relations
and deep data integration
17. Tools
finc d:swarm
focus data normalization data integration and enrichment
technology script-based transformations
(python, go, ElasticSearch)
encapsulates metafacture (open
source toolchain for metadata
transformation)
Property Graph (Neo4j)
status Works fine with ~100 mio.
records (less than one day)
Scability issues (~ 4 mio. records in
less than one day)
18. integrating finc with d:swarm
● enhance data processing regarding
○ authority data linking (NLP)
○ fuzzy deduplication
○ classification
○ relate bibliographic data to places, topics, abstract terms
○ publish machine readable data (linked data)
● create user interfaces to enable system librarians to control metadata
processing
19. Tomorrow: common workflows
● All data flows through both tools (finc + d:swarm)
● Deduplication (in graphDB easier duplication recognition)
● FRBRization (aggregate different physical and formal versions of a
work)
● Knowledge graph makes enrichment (authorities, altmetrics data,
usage data, …) and analytics easier
20. Scalability issues
● current implementation of property graph is too slow
● test results with 64GB RAM, SSD, 16 cores
○ 1,2 mio records (flat format): 10 hours for complete workflow
(ingest, transformation, export)
○ more complex formats (MARC21) up to 5x statements
● single Neo4j instance, storage and memory issues