Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(Big) bibliographic data @ ScaDS project meeting - 2015-06-12


Published on

Presentation at Big Data Competence Centre Dresden/Leipzig (ScaDS)

Published in: Education
  • Be the first to comment

(Big) bibliographic data @ ScaDS project meeting - 2015-06-12

  1. 1. (Big) Bibliographic Data UB Leipzig & SLUB Dresden ScaDS project meeting, 12.6.2015 Leander Seige, Felix Lohmeier, Ralf Talkenberger
  2. 2. “The library of the 21st century is a data hub.” quoted from an internal strategic paper of Leipzig University Library, 2015
  3. 3. simple bibliographic metadata <metadata> title author isbn publisher year … <resource> books serials newspapers articles ...
  4. 4. <resource> book ● printed books in the library’s shelves ● bought ebooks ● licensed ebooks ● pay-per-use ebooks ● free content ● ebooks to be bought by the library (patron driven acquisition = pda) ● even printed books to be bought by the library (pda too)
  5. 5. <resource> journals ● printed journals in the library’s shelves ● much more licensed electronic journals ○ full text accessible via web interfaces ● do we have article metadata? ● yes: licensed journal articles: 10s of millions per library
  6. 6. <metadata> accessibility information ● where is a ressource? (physical or on the net) ● who is allowed to access this content? (students? faculty? everyone?) ● is it available off-campus? ● did we buy it or is it just licensed? ● may the user copy or print it? ● is the library allowed to store the electronic file? ● may we grant access from wifi connections? ● ...or any combination of these...
  7. 7. <metadata> knowledge bases ● librarians built large knowledge bases to describe resources ● in german speaking countries: GND (Gemeinsame Normdatei) der Deutschen Nationalbibliothek ● international: ● provide dbpedia-links to explore the linked data cloud and to enrich library data
  8. 8. <metadata> knowledge bases ● GND (and other national authority files via VIAF) ○ describe Persons, Corporate bodies, Conferences and Events, Geographic Information, Topics, Works and relationships between them ○ form a generic knowledge base, independent from any specific domain ○ provide links to other knowledge bases (dbpedia, geonames...)
  9. 9. resource discovery ● traditional “OPACs” provided access to traditional library resources like printed books, users had to use proprietary vendor drive portals to access electronic ressources ● today, printed materials represent only a small part of library resources ● in contrast: resource discovery systems aim to integrate all resources of a library and present them in one single search interface
  10. 10. Cooperation ● UBL and SLUB joined forces in March 2015 ● Goals: a. Exchange of metadata after processing b. Develop common workflows to avoid “double work” → integrate existing tools finc & d:swarm
  11. 11. finc Community ● maintains a large search engine infrastructure ● developed and hosted at Leipzig University Library ● based on Apache Solr und VuFind ● rugged metadata management system, processing millions of data records each day ● integrates more than 50 data sources
  12. 12. finc Community ● provides more than 15 university libraries with resource discovery systems ● offers great potential to design and implement user oriented functions on real world systems, serving thousands of library users in Saxony and beyond, every day ● employs the aggregated index at Leipzig University Library
  13. 13. 10% physical items 90% electronic content on the net aggregated index at Leipzig University Library
  14. 14. aggregated index at Leipzig University Library ● 12 million traditional data records (growing) ● 80 million electronic article data records (growing) ● each records contains 20 data fields 1.8 billion triple (if you triplify it) (without any enrichment data)
  15. 15. Data processing today ● distributed data storage ○ 2 Solr in Leipzig (~12 mio + ~80 mio records) ○ 2 Solr in Dresden (~2 mio + ~2 mio records) ● constraint: each data source is handled separately → difficult to build up relations and deep data integration
  16. 16. d:swarm ● yet another tool…? a. property graph database b. gui for library staff
  17. 17. Tools finc d:swarm focus data normalization data integration and enrichment technology script-based transformations (python, go, ElasticSearch) encapsulates metafacture (open source toolchain for metadata transformation) Property Graph (Neo4j) status Works fine with ~100 mio. records (less than one day) Scability issues (~ 4 mio. records in less than one day)
  18. 18. integrating finc with d:swarm ● enhance data processing regarding ○ authority data linking (NLP) ○ fuzzy deduplication ○ classification ○ relate bibliographic data to places, topics, abstract terms ○ publish machine readable data (linked data) ● create user interfaces to enable system librarians to control metadata processing
  19. 19. Tomorrow: common workflows ● All data flows through both tools (finc + d:swarm) ● Deduplication (in graphDB easier duplication recognition) ● FRBRization (aggregate different physical and formal versions of a work) ● Knowledge graph makes enrichment (authorities, altmetrics data, usage data, …) and analytics easier
  20. 20. Scalability issues ● current implementation of property graph is too slow ● test results with 64GB RAM, SSD, 16 cores ○ 1,2 mio records (flat format): 10 hours for complete workflow (ingest, transformation, export) ○ more complex formats (MARC21) up to 5x statements ● single Neo4j instance, storage and memory issues
  21. 21. d:swarm architecture
  22. 22. Possible solutions? ● “mit Hardware erschlagen” ● Another graphDB, parallelization? ○ ArangoDB: ○ Apache Giraph: ○ Blaze Graph: (Wikidata’s choice) ● Gradoop?!