(Big) bibliographic data @ ScaDS project meeting - 2015-06-12

Felix Lohmeier
Felix LohmeierIT-Berater
(Big) Bibliographic Data
UB Leipzig & SLUB Dresden
ScaDS project meeting, 12.6.2015
Leander Seige, Felix Lohmeier, Ralf Talkenberger
“The library of the
21st century
is a data hub.”
quoted from an internal strategic paper of
Leipzig University Library, 2015
simple bibliographic metadata
<metadata>
title
author
isbn
publisher
year
…
<resource>
books
serials
newspapers
articles
...
<resource> book
● printed books in the library’s shelves
● bought ebooks
● licensed ebooks
● pay-per-use ebooks
● free content
● ebooks to be bought by the library (patron driven acquisition = pda)
● even printed books to be bought by the library (pda too)
<resource> journals
● printed journals in the library’s shelves
● much more licensed electronic journals
○ full text accessible via web interfaces
● do we have article metadata?
● yes: licensed journal articles: 10s of millions per library
<metadata> accessibility information
● where is a ressource? (physical or on the net)
● who is allowed to access this content? (students? faculty? everyone?)
● is it available off-campus?
● did we buy it or is it just licensed?
● may the user copy or print it?
● is the library allowed to store the electronic file?
● may we grant access from wifi connections?
● ...or any combination of these...
<metadata> knowledge bases
● librarians built large knowledge bases to describe resources
● in german speaking countries: GND (Gemeinsame Normdatei) der
Deutschen Nationalbibliothek http://www.dnb.de/EN/gnd
● international: http://viaf.org
● provide dbpedia-links to explore the linked data cloud and to enrich
library data
<metadata> knowledge bases
● GND (and other national authority files via VIAF)
○ describe Persons, Corporate bodies, Conferences and Events,
Geographic Information, Topics, Works and relationships
between them
○ form a generic knowledge base, independent from any specific
domain
○ provide links to other knowledge bases (dbpedia, geonames...)
resource discovery
● traditional “OPACs” provided access to traditional library resources like
printed books, users had to use proprietary vendor drive portals to
access electronic ressources
● today, printed materials represent only a small part of library resources
● in contrast: resource discovery systems aim to integrate all
resources of a library and present them in one single search
interface
Cooperation
● UBL and SLUB joined forces in March 2015
● Goals:
a. Exchange of metadata after processing
b. Develop common workflows to avoid “double work”
→ integrate existing tools finc & d:swarm
finc Community
● maintains a large search engine infrastructure
● developed and hosted at Leipzig University Library
● based on Apache Solr und VuFind
● rugged metadata management system,
processing millions of data records each day
● integrates more than 50 data sources
https://finc.info
finc Community
● provides more than 15 university libraries with
resource discovery systems
● offers great potential to design and implement user oriented
functions on real world systems, serving thousands of library
users in Saxony and beyond, every day
● employs the aggregated index at Leipzig University Library
https://finc.info
10%
physical items
90%
electronic content
on the net
aggregated index at
Leipzig University Library
aggregated index at
Leipzig University Library
● 12 million traditional data records (growing)
● 80 million electronic article data records (growing)
● each records contains 20 data fields
1.8 billion triple
(if you triplify it)
(without any enrichment data)
Data processing today
● distributed data storage
○ 2 Solr in Leipzig
(~12 mio + ~80 mio records)
○ 2 Solr in Dresden
(~2 mio + ~2 mio records)
● constraint: each data source is
handled separately
→ difficult to build up relations
and deep data integration
d:swarm
● yet another tool…?
a. property graph database
b. gui for library staff
Tools
finc d:swarm
focus data normalization data integration and enrichment
technology script-based transformations
(python, go, ElasticSearch)
encapsulates metafacture (open
source toolchain for metadata
transformation)
Property Graph (Neo4j)
status Works fine with ~100 mio.
records (less than one day)
Scability issues (~ 4 mio. records in
less than one day)
integrating finc with d:swarm
● enhance data processing regarding
○ authority data linking (NLP)
○ fuzzy deduplication
○ classification
○ relate bibliographic data to places, topics, abstract terms
○ publish machine readable data (linked data)
● create user interfaces to enable system librarians to control metadata
processing
Tomorrow: common workflows
● All data flows through both tools (finc + d:swarm)
● Deduplication (in graphDB easier duplication recognition)
● FRBRization (aggregate different physical and formal versions of a
work)
● Knowledge graph makes enrichment (authorities, altmetrics data,
usage data, …) and analytics easier
Scalability issues
● current implementation of property graph is too slow
● test results with 64GB RAM, SSD, 16 cores
○ 1,2 mio records (flat format): 10 hours for complete workflow
(ingest, transformation, export)
○ more complex formats (MARC21) up to 5x statements
● single Neo4j instance, storage and memory issues
d:swarm architecture
Possible solutions?
● “mit Hardware erschlagen”
● Another graphDB, parallelization?
○ ArangoDB: https://www.arangodb.com
○ Apache Giraph: http://giraph.apache.org
○ Blaze Graph: http://blazegraph.com (Wikidata’s choice)
● Gradoop?!
1 of 22

Recommended

Open content opens up new avenues of research by
Open content opens up new avenues of researchOpen content opens up new avenues of research
Open content opens up new avenues of researchFelix Lohmeier
2.7K views14 slides
OpenMinTeD - Repositories in the centre of new scientific knowledge by
OpenMinTeD - Repositories in the centre of new scientific knowledgeOpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD - Repositories in the centre of new scientific knowledgeopenminted_eu
573 views11 slides
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha... by
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...Peter Löwe
664 views42 slides
Jisc Text Mining Capabilities by
Jisc Text Mining CapabilitiesJisc Text Mining Capabilities
Jisc Text Mining Capabilitiesopenminted_eu
356 views14 slides
How can repositories support the text mining of their content and why? by
How can repositories support the text mining of their content and why?How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?openminted_eu
1.1K views20 slides
The Future is All Mine by
The Future is All MineThe Future is All Mine
The Future is All Mineopenminted_eu
657 views16 slides

More Related Content

What's hot

20170501 Distributed Network of Digital Heritage Information by
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage InformationEnno Meijers
1.3K views23 slides
Text Mining: the next data frontier. Beyond Open Access by
Text Mining: the next data frontier. Beyond Open AccessText Mining: the next data frontier. Beyond Open Access
Text Mining: the next data frontier. Beyond Open Accessopenminted_eu
417 views27 slides
Presentation of the OpenAIRE webinars during the Open Access Week 2016 by
Presentation of the OpenAIRE webinars during the Open Access Week 2016Presentation of the OpenAIRE webinars during the Open Access Week 2016
Presentation of the OpenAIRE webinars during the Open Access Week 2016OpenAIRE
551 views7 slides
Datele in biblioteca noi servicii / Bibliotheken als Datenzentren: ein Einbli... by
Datele in biblioteca noi servicii / Bibliotheken als Datenzentren: ein Einbli...Datele in biblioteca noi servicii / Bibliotheken als Datenzentren: ein Einbli...
Datele in biblioteca noi servicii / Bibliotheken als Datenzentren: ein Einbli...Nicolaie Constantinescu
272 views53 slides
Open Science Days 2014 - Becker - Repositories and Linked Data by
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataPascal-Nicolas Becker
853 views13 slides
Cloud Transforms Culture, Europeana Tech 2014 by
Cloud Transforms Culture, Europeana Tech 2014Cloud Transforms Culture, Europeana Tech 2014
Cloud Transforms Culture, Europeana Tech 2014PavelKats
210 views12 slides

What's hot(20)

20170501 Distributed Network of Digital Heritage Information by Enno Meijers
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage Information
Enno Meijers1.3K views
Text Mining: the next data frontier. Beyond Open Access by openminted_eu
Text Mining: the next data frontier. Beyond Open AccessText Mining: the next data frontier. Beyond Open Access
Text Mining: the next data frontier. Beyond Open Access
openminted_eu417 views
Presentation of the OpenAIRE webinars during the Open Access Week 2016 by OpenAIRE
Presentation of the OpenAIRE webinars during the Open Access Week 2016Presentation of the OpenAIRE webinars during the Open Access Week 2016
Presentation of the OpenAIRE webinars during the Open Access Week 2016
OpenAIRE551 views
Datele in biblioteca noi servicii / Bibliotheken als Datenzentren: ein Einbli... by Nicolaie Constantinescu
Datele in biblioteca noi servicii / Bibliotheken als Datenzentren: ein Einbli...Datele in biblioteca noi servicii / Bibliotheken als Datenzentren: ein Einbli...
Datele in biblioteca noi servicii / Bibliotheken als Datenzentren: ein Einbli...
Open Science Days 2014 - Becker - Repositories and Linked Data by Pascal-Nicolas Becker
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
Cloud Transforms Culture, Europeana Tech 2014 by PavelKats
Cloud Transforms Culture, Europeana Tech 2014Cloud Transforms Culture, Europeana Tech 2014
Cloud Transforms Culture, Europeana Tech 2014
PavelKats210 views
Beyond OpenAIRE2020 by OpenAIRE
Beyond OpenAIRE2020Beyond OpenAIRE2020
Beyond OpenAIRE2020
OpenAIRE461 views
OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ... by OpenAIRE
OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ...OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ...
OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ...
OpenAIRE446 views
Enabling Accessible Resource Access via Service Providers by Alexander Haffner
Enabling Accessible Resource Access via Service ProvidersEnabling Accessible Resource Access via Service Providers
Enabling Accessible Resource Access via Service Providers
Alexander Haffner1.3K views
Forschungsdaten-Repositorien Typen, Herausforderungen und Perspektiven by Heinz Pampel
Forschungsdaten-Repositorien Typen, Herausforderungen und PerspektivenForschungsdaten-Repositorien Typen, Herausforderungen und Perspektiven
Forschungsdaten-Repositorien Typen, Herausforderungen und Perspektiven
Heinz Pampel1K views
Library and data lecture for inf21306 by Hugo Besemer
Library and data lecture for  inf21306Library and data lecture for  inf21306
Library and data lecture for inf21306
Hugo Besemer333 views
OpenAIRE @ OECD Blue Sky III by OpenAIRE
OpenAIRE @ OECD Blue Sky IIIOpenAIRE @ OECD Blue Sky III
OpenAIRE @ OECD Blue Sky III
OpenAIRE3K views
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritte by Equipex Biblissima
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritteDa Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Da Biblissima a Biblissima+ : per un osservatorio delle culture scritte
Equipex Biblissima347 views
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen by WARCnet
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard JensenTuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
Tuesday 5 May: The Shapes of Archives and Memory, Helle Strandgaard Jensen
WARCnet 1.1K views
Web at 25 - Ontos Linked Open Data by AI4BD GmbH
Web at 25 - Ontos Linked Open DataWeb at 25 - Ontos Linked Open Data
Web at 25 - Ontos Linked Open Data
AI4BD GmbH873 views
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ... by Equipex Biblissima
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...
Digital Manuscripts Without Borders: A Discovery Platform of Manuscripts and ...
Equipex Biblissima901 views

Viewers also liked

Open Source Bibliotheksmanagement (mit D:SWARM + AMSL) by
Open Source Bibliotheksmanagement (mit D:SWARM + AMSL)Open Source Bibliotheksmanagement (mit D:SWARM + AMSL)
Open Source Bibliotheksmanagement (mit D:SWARM + AMSL)Felix Lohmeier
4.2K views30 slides
Fachreferat 3.0 - mit Campus Communities den Forschungsdiskurs auf digitaler ... by
Fachreferat 3.0 - mit Campus Communities den Forschungsdiskurs auf digitaler ...Fachreferat 3.0 - mit Campus Communities den Forschungsdiskurs auf digitaler ...
Fachreferat 3.0 - mit Campus Communities den Forschungsdiskurs auf digitaler ...Felix Lohmeier
3K views18 slides
VIVO Use Case Dresden #VIVODE15 9.9.2015 by
VIVO Use Case Dresden #VIVODE15 9.9.2015VIVO Use Case Dresden #VIVODE15 9.9.2015
VIVO Use Case Dresden #VIVODE15 9.9.2015Felix Lohmeier
3.6K views17 slides
Leitbild Openness - Bibliotheken als Wächter für den (dauerhaft) freien Zugan... by
Leitbild Openness - Bibliotheken als Wächter für den (dauerhaft) freien Zugan...Leitbild Openness - Bibliotheken als Wächter für den (dauerhaft) freien Zugan...
Leitbild Openness - Bibliotheken als Wächter für den (dauerhaft) freien Zugan...Felix Lohmeier
6.3K views14 slides
TextGrid 2.0 @ Bibliothekartag 2012 by
TextGrid 2.0 @ Bibliothekartag 2012TextGrid 2.0 @ Bibliothekartag 2012
TextGrid 2.0 @ Bibliothekartag 2012Felix Lohmeier
2.2K views13 slides
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014) by
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)Jan Polowinski
2K views14 slides

Viewers also liked(9)

Open Source Bibliotheksmanagement (mit D:SWARM + AMSL) by Felix Lohmeier
Open Source Bibliotheksmanagement (mit D:SWARM + AMSL)Open Source Bibliotheksmanagement (mit D:SWARM + AMSL)
Open Source Bibliotheksmanagement (mit D:SWARM + AMSL)
Felix Lohmeier4.2K views
Fachreferat 3.0 - mit Campus Communities den Forschungsdiskurs auf digitaler ... by Felix Lohmeier
Fachreferat 3.0 - mit Campus Communities den Forschungsdiskurs auf digitaler ...Fachreferat 3.0 - mit Campus Communities den Forschungsdiskurs auf digitaler ...
Fachreferat 3.0 - mit Campus Communities den Forschungsdiskurs auf digitaler ...
Felix Lohmeier3K views
VIVO Use Case Dresden #VIVODE15 9.9.2015 by Felix Lohmeier
VIVO Use Case Dresden #VIVODE15 9.9.2015VIVO Use Case Dresden #VIVODE15 9.9.2015
VIVO Use Case Dresden #VIVODE15 9.9.2015
Felix Lohmeier3.6K views
Leitbild Openness - Bibliotheken als Wächter für den (dauerhaft) freien Zugan... by Felix Lohmeier
Leitbild Openness - Bibliotheken als Wächter für den (dauerhaft) freien Zugan...Leitbild Openness - Bibliotheken als Wächter für den (dauerhaft) freien Zugan...
Leitbild Openness - Bibliotheken als Wächter für den (dauerhaft) freien Zugan...
Felix Lohmeier6.3K views
TextGrid 2.0 @ Bibliothekartag 2012 by Felix Lohmeier
TextGrid 2.0 @ Bibliothekartag 2012TextGrid 2.0 @ Bibliothekartag 2012
TextGrid 2.0 @ Bibliothekartag 2012
Felix Lohmeier2.2K views
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014) by Jan Polowinski
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Jan Polowinski2K views
Schlanke Discovery-Lösung auf Basis von TYPO3. Der neue Bibliothekskatalog de... by Felix Lohmeier
Schlanke Discovery-Lösung auf Basis von TYPO3. Der neue Bibliothekskatalog de...Schlanke Discovery-Lösung auf Basis von TYPO3. Der neue Bibliothekskatalog de...
Schlanke Discovery-Lösung auf Basis von TYPO3. Der neue Bibliothekskatalog de...
Felix Lohmeier3.7K views
d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr... by Jens Mittelbach
d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...
d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...
Jens Mittelbach5.7K views
Installation einer virtuellen Maschine (Ubuntu MATE 16.04 LTS) auf USB-Stick ... by Felix Lohmeier
Installation einer virtuellen Maschine (Ubuntu MATE 16.04 LTS) auf USB-Stick ...Installation einer virtuellen Maschine (Ubuntu MATE 16.04 LTS) auf USB-Stick ...
Installation einer virtuellen Maschine (Ubuntu MATE 16.04 LTS) auf USB-Stick ...
Felix Lohmeier1.2K views

Similar to (Big) bibliographic data @ ScaDS project meeting - 2015-06-12

Seige arndt-lightning talk swib13 by
Seige arndt-lightning talk swib13Seige arndt-lightning talk swib13
Seige arndt-lightning talk swib13Leander Seige
6.2K views14 slides
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data. by
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.Mike Mertens
1.7K views37 slides
IFLA LIDASIG Open Session 2017: Introduction to Linked Data by
IFLA LIDASIG Open Session 2017: Introduction to Linked DataIFLA LIDASIG Open Session 2017: Introduction to Linked Data
IFLA LIDASIG Open Session 2017: Introduction to Linked DataLars G. Svensson
297 views37 slides
Linked data presentation for libraries (COMO) by
Linked data presentation for libraries (COMO)Linked data presentation for libraries (COMO)
Linked data presentation for libraries (COMO)robin fay
1.1K views41 slides
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin... by
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...Dr. Haxel Consult
1.4K views24 slides
Linked Open Data: Identifying Opportunities by
Linked Open Data: Identifying OpportunitiesLinked Open Data: Identifying Opportunities
Linked Open Data: Identifying OpportunitiesLibrary_Connect
1.2K views20 slides

Similar to (Big) bibliographic data @ ScaDS project meeting - 2015-06-12(20)

Seige arndt-lightning talk swib13 by Leander Seige
Seige arndt-lightning talk swib13Seige arndt-lightning talk swib13
Seige arndt-lightning talk swib13
Leander Seige6.2K views
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data. by Mike Mertens
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Mike Mertens1.7K views
IFLA LIDASIG Open Session 2017: Introduction to Linked Data by Lars G. Svensson
IFLA LIDASIG Open Session 2017: Introduction to Linked DataIFLA LIDASIG Open Session 2017: Introduction to Linked Data
IFLA LIDASIG Open Session 2017: Introduction to Linked Data
Lars G. Svensson297 views
Linked data presentation for libraries (COMO) by robin fay
Linked data presentation for libraries (COMO)Linked data presentation for libraries (COMO)
Linked data presentation for libraries (COMO)
robin fay1.1K views
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin... by Dr. Haxel Consult
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
Dr. Haxel Consult1.4K views
Linked Open Data: Identifying Opportunities by Library_Connect
Linked Open Data: Identifying OpportunitiesLinked Open Data: Identifying Opportunities
Linked Open Data: Identifying Opportunities
Library_Connect1.2K views
CLARIAH Toogdag 2018: A distributed network of digital heritage information by Enno Meijers
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
Enno Meijers307 views
A Comparative Kalendar - DH2013 Presentation by blalbritton
A Comparative Kalendar - DH2013 PresentationA Comparative Kalendar - DH2013 Presentation
A Comparative Kalendar - DH2013 Presentation
blalbritton852 views
The ABES Discovery Study by ABES
The ABES Discovery StudyThe ABES Discovery Study
The ABES Discovery Study
ABES1.4K views
What can linked data do for digital libraries by Sören Auer
What can linked data do for digital librariesWhat can linked data do for digital libraries
What can linked data do for digital libraries
Sören Auer6.6K views
131205 KU Leuven and the LIBISnet consortium on the way to the next generatio... by Jo Rademakers
131205 KU Leuven and the LIBISnet consortium on the way to the next generatio...131205 KU Leuven and the LIBISnet consortium on the way to the next generatio...
131205 KU Leuven and the LIBISnet consortium on the way to the next generatio...
Jo Rademakers658 views
Publishing the British National Bibliography as Linked Open Data / Corine Del... by CIGScotland
Publishing the British National Bibliography as Linked Open Data / Corine Del...Publishing the British National Bibliography as Linked Open Data / Corine Del...
Publishing the British National Bibliography as Linked Open Data / Corine Del...
CIGScotland1.2K views
Wikisource - Where we are, where we want to go by AubreyMcFato
Wikisource  - Where we are, where we want to go Wikisource  - Where we are, where we want to go
Wikisource - Where we are, where we want to go
AubreyMcFato1.4K views
lodlam summit session browsable linked data by Enno Meijers
lodlam summit session browsable linked datalodlam summit session browsable linked data
lodlam summit session browsable linked data
Enno Meijers299 views
Local content in a Europeana cloud for small & medium content providers by locloud
Local content in a Europeana cloud for small & medium content providersLocal content in a Europeana cloud for small & medium content providers
Local content in a Europeana cloud for small & medium content providers
locloud1.1K views
The web of interlinked data and knowledge stripped by Sören Auer
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
Sören Auer1.8K views
LS DIGITAL FOR DIGITAL LIBRARY by guestfa5009
 LS DIGITAL  FOR DIGITAL LIBRARY LS DIGITAL  FOR DIGITAL LIBRARY
LS DIGITAL FOR DIGITAL LIBRARY
guestfa5009643 views

Recently uploaded

CUNY IT Picciano.pptx by
CUNY IT Picciano.pptxCUNY IT Picciano.pptx
CUNY IT Picciano.pptxapicciano
60 views17 slides
Nelson_RecordStore.pdf by
Nelson_RecordStore.pdfNelson_RecordStore.pdf
Nelson_RecordStore.pdfBrynNelson5
46 views10 slides
ANGULARJS.pdf by
ANGULARJS.pdfANGULARJS.pdf
ANGULARJS.pdfArthyR3
49 views10 slides
Berry country.pdf by
Berry country.pdfBerry country.pdf
Berry country.pdfMariaKenney3
61 views12 slides
Creative Restart 2023: Leonard Savage - The Permanent Brief: Unearthing unobv... by
Creative Restart 2023: Leonard Savage - The Permanent Brief: Unearthing unobv...Creative Restart 2023: Leonard Savage - The Permanent Brief: Unearthing unobv...
Creative Restart 2023: Leonard Savage - The Permanent Brief: Unearthing unobv...Taste
53 views21 slides
StudioX.pptx by
StudioX.pptxStudioX.pptx
StudioX.pptxNikhileshSathyavarap
89 views18 slides

Recently uploaded(20)

CUNY IT Picciano.pptx by apicciano
CUNY IT Picciano.pptxCUNY IT Picciano.pptx
CUNY IT Picciano.pptx
apicciano60 views
Nelson_RecordStore.pdf by BrynNelson5
Nelson_RecordStore.pdfNelson_RecordStore.pdf
Nelson_RecordStore.pdf
BrynNelson546 views
ANGULARJS.pdf by ArthyR3
ANGULARJS.pdfANGULARJS.pdf
ANGULARJS.pdf
ArthyR349 views
Creative Restart 2023: Leonard Savage - The Permanent Brief: Unearthing unobv... by Taste
Creative Restart 2023: Leonard Savage - The Permanent Brief: Unearthing unobv...Creative Restart 2023: Leonard Savage - The Permanent Brief: Unearthing unobv...
Creative Restart 2023: Leonard Savage - The Permanent Brief: Unearthing unobv...
Taste53 views
SURGICAL MANAGEMENT OF CERVICAL CANCER DR. NN CHAVAN 28102023.pptx by Niranjan Chavan
SURGICAL MANAGEMENT OF CERVICAL CANCER DR. NN CHAVAN 28102023.pptxSURGICAL MANAGEMENT OF CERVICAL CANCER DR. NN CHAVAN 28102023.pptx
SURGICAL MANAGEMENT OF CERVICAL CANCER DR. NN CHAVAN 28102023.pptx
Niranjan Chavan43 views
Retail Store Scavenger Hunt.pptx by jmurphy154
Retail Store Scavenger Hunt.pptxRetail Store Scavenger Hunt.pptx
Retail Store Scavenger Hunt.pptx
jmurphy15452 views
Guess Papers ADC 1, Karachi University by Khalid Aziz
Guess Papers ADC 1, Karachi UniversityGuess Papers ADC 1, Karachi University
Guess Papers ADC 1, Karachi University
Khalid Aziz83 views
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (FRIE... by Nguyen Thanh Tu Collection
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (FRIE...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (FRIE...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (FRIE...
Education of marginalized and socially disadvantages segments.pptx by GarimaBhati5
Education of marginalized and socially disadvantages segments.pptxEducation of marginalized and socially disadvantages segments.pptx
Education of marginalized and socially disadvantages segments.pptx
GarimaBhati540 views
NodeJS and ExpressJS.pdf by ArthyR3
NodeJS and ExpressJS.pdfNodeJS and ExpressJS.pdf
NodeJS and ExpressJS.pdf
ArthyR347 views
Creative Restart 2023: Atila Martins - Craft: A Necessity, Not a Choice by Taste
Creative Restart 2023: Atila Martins - Craft: A Necessity, Not a ChoiceCreative Restart 2023: Atila Martins - Craft: A Necessity, Not a Choice
Creative Restart 2023: Atila Martins - Craft: A Necessity, Not a Choice
Taste41 views
When Sex Gets Complicated: Porn, Affairs, & Cybersex by Marlene Maheu
When Sex Gets Complicated: Porn, Affairs, & CybersexWhen Sex Gets Complicated: Porn, Affairs, & Cybersex
When Sex Gets Complicated: Porn, Affairs, & Cybersex
Marlene Maheu108 views

(Big) bibliographic data @ ScaDS project meeting - 2015-06-12

  • 1. (Big) Bibliographic Data UB Leipzig & SLUB Dresden ScaDS project meeting, 12.6.2015 Leander Seige, Felix Lohmeier, Ralf Talkenberger
  • 2. “The library of the 21st century is a data hub.” quoted from an internal strategic paper of Leipzig University Library, 2015
  • 4. <resource> book ● printed books in the library’s shelves ● bought ebooks ● licensed ebooks ● pay-per-use ebooks ● free content ● ebooks to be bought by the library (patron driven acquisition = pda) ● even printed books to be bought by the library (pda too)
  • 5. <resource> journals ● printed journals in the library’s shelves ● much more licensed electronic journals ○ full text accessible via web interfaces ● do we have article metadata? ● yes: licensed journal articles: 10s of millions per library
  • 6. <metadata> accessibility information ● where is a ressource? (physical or on the net) ● who is allowed to access this content? (students? faculty? everyone?) ● is it available off-campus? ● did we buy it or is it just licensed? ● may the user copy or print it? ● is the library allowed to store the electronic file? ● may we grant access from wifi connections? ● ...or any combination of these...
  • 7. <metadata> knowledge bases ● librarians built large knowledge bases to describe resources ● in german speaking countries: GND (Gemeinsame Normdatei) der Deutschen Nationalbibliothek http://www.dnb.de/EN/gnd ● international: http://viaf.org ● provide dbpedia-links to explore the linked data cloud and to enrich library data
  • 8. <metadata> knowledge bases ● GND (and other national authority files via VIAF) ○ describe Persons, Corporate bodies, Conferences and Events, Geographic Information, Topics, Works and relationships between them ○ form a generic knowledge base, independent from any specific domain ○ provide links to other knowledge bases (dbpedia, geonames...)
  • 9. resource discovery ● traditional “OPACs” provided access to traditional library resources like printed books, users had to use proprietary vendor drive portals to access electronic ressources ● today, printed materials represent only a small part of library resources ● in contrast: resource discovery systems aim to integrate all resources of a library and present them in one single search interface
  • 10. Cooperation ● UBL and SLUB joined forces in March 2015 ● Goals: a. Exchange of metadata after processing b. Develop common workflows to avoid “double work” → integrate existing tools finc & d:swarm
  • 11. finc Community ● maintains a large search engine infrastructure ● developed and hosted at Leipzig University Library ● based on Apache Solr und VuFind ● rugged metadata management system, processing millions of data records each day ● integrates more than 50 data sources https://finc.info
  • 12. finc Community ● provides more than 15 university libraries with resource discovery systems ● offers great potential to design and implement user oriented functions on real world systems, serving thousands of library users in Saxony and beyond, every day ● employs the aggregated index at Leipzig University Library https://finc.info
  • 13. 10% physical items 90% electronic content on the net aggregated index at Leipzig University Library
  • 14. aggregated index at Leipzig University Library ● 12 million traditional data records (growing) ● 80 million electronic article data records (growing) ● each records contains 20 data fields 1.8 billion triple (if you triplify it) (without any enrichment data)
  • 15. Data processing today ● distributed data storage ○ 2 Solr in Leipzig (~12 mio + ~80 mio records) ○ 2 Solr in Dresden (~2 mio + ~2 mio records) ● constraint: each data source is handled separately → difficult to build up relations and deep data integration
  • 16. d:swarm ● yet another tool…? a. property graph database b. gui for library staff
  • 17. Tools finc d:swarm focus data normalization data integration and enrichment technology script-based transformations (python, go, ElasticSearch) encapsulates metafacture (open source toolchain for metadata transformation) Property Graph (Neo4j) status Works fine with ~100 mio. records (less than one day) Scability issues (~ 4 mio. records in less than one day)
  • 18. integrating finc with d:swarm ● enhance data processing regarding ○ authority data linking (NLP) ○ fuzzy deduplication ○ classification ○ relate bibliographic data to places, topics, abstract terms ○ publish machine readable data (linked data) ● create user interfaces to enable system librarians to control metadata processing
  • 19. Tomorrow: common workflows ● All data flows through both tools (finc + d:swarm) ● Deduplication (in graphDB easier duplication recognition) ● FRBRization (aggregate different physical and formal versions of a work) ● Knowledge graph makes enrichment (authorities, altmetrics data, usage data, …) and analytics easier
  • 20. Scalability issues ● current implementation of property graph is too slow ● test results with 64GB RAM, SSD, 16 cores ○ 1,2 mio records (flat format): 10 hours for complete workflow (ingest, transformation, export) ○ more complex formats (MARC21) up to 5x statements ● single Neo4j instance, storage and memory issues
  • 22. Possible solutions? ● “mit Hardware erschlagen” ● Another graphDB, parallelization? ○ ArangoDB: https://www.arangodb.com ○ Apache Giraph: http://giraph.apache.org ○ Blaze Graph: http://blazegraph.com (Wikidata’s choice) ● Gradoop?!