BASE is a powerful search engine that harvests over 2,900 repositories via OAI-PMH, indexing over 37 million documents. It faces challenges in harvesting like repositories not responding or metadata having many different values. It performs automatic classification of documents by analyzing fields like descriptions, fulltext, and subject headings. This allows subject-based browsing of repositories in specific disciplines that contain the majority of open access documents.
BASE : a powerful search engine for Open Access documents
1. Universitätsbibliothek
BASE – a powerful search engine
for Open Access documents
AIMS@OA Week
25 Oct 2012
Friedrich Summann
Bielefeld University Library
2. Universitätsbibliothek
Overview
BASE – the OA search engine
Harvesting OAI-PMH and its challenges
Metadata Aggregation and Data Quality
Processing Subject Repositories
3. Universitätsbibliothek
Harvesting Background
BASE (Bielefeld Academic Search Engine)
• started in 2002, active since 2004
• 2900 repositories harvested via OAI-PMH
• 2337 repositories indexed
• 37.4 Mill. documents included
• 3.1 Mill. documents automatically classified
• Lucene/Solr Index
• VuFind end-user GUI
7. Universitätsbibliothek
BASE Interfaces
• Query REST interface
• Repository Metadata interface
• Data Delivery Interface (Repository based, DDC
of aggregated Metadata) (under construction)
8. Universitätsbibliothek
Overview
BASE – the OA search engine
Harvesting OAI-PMH and its challenges
Metadata Aggregation and Data Quality
Processing Repositories
10. Universitätsbibliothek
Harvesting : Challenges and pitfalls
Repository does not respond (temporarily, specific verbs)
Results are not xml-valid
Harvesting breaks (especially big reps)
Incremental Harvesting does not work
No deleting information, added records
Variety of Field Contents
Change of behavior (basicurl, contents)
Metadata point to reference or citation only
Link to Document is not operable
Fulltext access is restricted (non OA)
11. Universitätsbibliothek
Overview
BASE – the OA search engine
Harvesting OAI-PMH and its challenges
Metadata Aggregation and Data Quality
Processing Subject Repositories
12. Universitätsbibliothek
dc:language: Variety of Metadata Values
Analysis: European Repositories, Oct. 2009
804 different values in 4720585 tags
Top values
;-3
en – 1385175 ?-3
eng – 511085
at;deu - 2
spa – 345658
de – 319937 enm;eng - 2
en_GB - 178381 FRA – 2
ger – 166587 fr_BE - 2
eng; - 102678 Andere Sprache – 2
FR – 95798 cat, spa, fra, eng. - 2
…
l
13. Universitätsbibliothek
dc:type: Variety of Metadata Values
Analysis: German Repositories, Sept. 2009
2772 different values in 1394089 tags
Top values
Software - 7
Dataset – 588525
Artikel – 192306 Kulturkarten - 7
Rezension – 113924 Composition - 7
Text – 73210 Interactive Resource - 4
Text.Thesis.Doctoral – 30201 Interview – 3
Article – 29278 Media - 1
Miszelle – 27060 content analysis – 1
NonPeerReviewed – 24688 Anniversary Publication – 1
ResearchPaper – 16046 qualitative research -1
Dissertation - 15531
…
l
14. Universitätsbibliothek
Overview
BASE – the OA search engine
Harvesting OAI-PMH and its challenges
Metadata Aggregation and Data Quality
Processing Subject Repositories
18. Universitätsbibliothek
Contents for Classifier Feed
dc:description: 30 to 40 % of metadata records have dc:description
with relevant abstract information
Document fulltext (if accessible)
Setspec contains ddc and lcc codes
dc:subject contains lots of subject-orientated information