Thomas Krichel (Long Island University) – AuthorClaim
IR repository data in AuthorClaim Wolfram Horstmann1 Thomas Krichel2,3,41 Bielefeld University 2 Long Island University 3 Novosibirsk State University 4 Open Library Society Repository Fringe 2011–08–03
thanksto the organizersto the BASE contributors Bernd Fehling, Marek Imialek, Mathias Loesch, Renata Mitrenga, Dirk Pieper, Jochen Schirrwagen, Friedrich Summann, Sebastian Wolf
structurebackground3libAuthor claiming and AuthorClaimBASE andAuthorClaim and BASE
backgroundWolfram is chief information oﬃcer(scholarly Information) at BielefeldUniversity.They run the BASE search enginesince 2004They are doing long-run work.
backgroundThomas the founder of RePEc.Thomas started this in the early 90s.Thomas is doing long-run work.
motivationMake (economics) papers freelyavailable.Make information about the papersfreely available.Have a self-sustaining infrastructure ofthis, don’t rely on external sources.
RePEcRePEc is misunderstood as arepository.In fact it is a collection of 1300+institutional (subject) repositories. pre-date OAI reduced business model more tightly interoperable
RePEc sources of successThere are a lot of sources of success.The reason can be classiﬁed business case technical matterboth are linked
RePEc business caseRePEc tries to decentralize as much aswe can.RePEc run essentially on volunteerpower.RePEc encourage reuse of RePEcdata.
RePEc technical caseRePEc registers authors with theRePEc Author Service (RAS).RePEc registers institutions.RePEc provides evaluative data forauthors and institutions.
RePEc and IRsRePEc is not a repository.RePEc is a bibliographic layer overrepositories.IRs can/will beneﬁt from a similarbibliographic layer.
requirement for such a layerNot dependent on external funding.Freely reusable instantaneously.Must be there for the long-run.
a RePEc for all disciplinesRePEc bibliographic data → 3libRePEc Author Service → AuthorClaimEDIRC → ARIW
3lib 3lib is an initial attempt at building an aggregate of freely available bibliographic data. It’s a project by OLS sponsored by OKFN. About 35 million records from the usual suspects: PubMed, OpenLibrary, DBLP, RePEc.
3lib elementsThe data elements in 3lib are verysimple title author name expressions link to item page on provider site identiﬁer3lib is meant to serve AuthorClaim.
AuthorClaimAuthorClaim is an authorship claimingservice for 3lib data.It lives at http://authorclaim.org.It uses the same software as theRePEc Author Service, called ACIS.It is running since early 2008.
author claiming history IThomas started the ﬁrst authorclaiming system, the RePEc authorservice in 1999.The system was written by MarkusJ.R. Klink.
author claiming history IIISI created researcherID in 2006 (?)arXiv have an author claiming systemsince 2009.NIH and Google Scholar are workingon it.The ORCID initiative is looking intoauthor identiﬁcation since 2009.
claiming vs identiﬁcationAuthor claiming records are NOTauthor identiﬁcation records.The diﬀerence is called “Klink’sproblem”.An person can claim to be an authorof a paper. If there are several author,we don’t know what author (s)he is.
Klink’s problem exampleJane and John Smith write a paper.Author list say “J. Smith and J.Smith”
AuthorClaim dataftp://ftp.authorclaim.orgCC0more than 100 proﬁles, growing slowly.
more on the exampleThe refused papers are there forservices to build learning models forauthor names. Actually learning is anintegral part of the way AuthorClaimworks.Actually records also contain the 3libdata for papers.and they have ARIW-base aﬃliationdata.
IRs and author identiﬁcationIRs are generally too large to authoridentiﬁcation by IR staﬀ.Only registration of contributors isusually required.
IRs and author claimingIRs are too small to make itmeaningful for authors to claim papersin them directly.
beneﬁts of author claiming toIRAll papers by an author can be puttogether.The task can be completelyautomated once an AuthorClaimrecord claims a paper in the IR.
to get it doneFrom the four elements in a 3librecord only the link to a web pagedescribing the item is problematic.But is cumbersome to customize toclose to 2k IRs.
partnership with BASEWe need a centralized collection.BASE is already doing this job.BASE can deliver all data toAuthorClaim regularly.
BASE aggregation servicesconstant monitoring of OAI-PMHworldconﬁguration of harvesting for eachnew and erroneous repositorymetadata stores (raw and normalized)
BASE normalizationhighly heterogeneous use of OAI-DCrequires cleaning and enrichment, e.g. dc:type dc:date dc:languagealso enrichment with (missing) subjectclassiﬁcations
BASE search servicesbuilds index (solr)end user interfaces (vuﬁnd)iPhone-AppAPI for index usage by third parties(http or SOAP)
BASE data servicesrepository proﬁle service (REST)raw metadata store access (http orSOAP)rsync for AuthorClaim
BASE data in AuthorClaimselection of records that have requireddata author title link identiﬁerincremental updates
repository exclusionFrom BASE proﬁle AuthorClaimdiscards some IRs that contain student work digitized old material link collections primary research dataThere are some minor manualexclusions.
results so far 1930 repositories, 12740116 records. 534 records claimed. The documentation at http://wotan.liu.edu/base/ needs some debugging. The collection is not yet announced because it is being read.
the endContact firstname.lastname@example.org email@example.com more information.