OpenAIRE Text notes of the Tutorial on Automatic Inference Of Links
Once your repository platform has been made OpenAIRE compliant, researchers from your institution will
be able to deposit their publications by providing the relative files and bibliographic metadata, inclusive of
license information and the list of EC projects which funded such publications. Integrating your repository
with the OpenAIRE infrastructure is an important step towards helping your researchers at complying with
the EC Open Access mandate. However, while this will be a clear benefit for the future, what happens with
all the publications deposited in the past, whose metadata did not include EC project information?
You can approach the problem in two ways. The so-called manual approach consists in asking your
researchers to revise and complete all past depositions through the newly provided user interfaces. Since
this may be a tedious job, the OpenAIRE infrastructure offers an automatic inference approach, according
to which special services are capable of inferring from the PDF files of the publications the list of EC projects
that have likely funded such publications.
To this aim, repository managers must make available the PDF files of the publications to the OpenAIRE
infrastructure. This can happen through standards protocols, such as FTP, to be agreed with the OpenAIRE
technical team. Most importantly, the names of the PDF files must include the OAI-PMH identifier provided
with the corresponding metadata records. This implicit link will allow for the completion of the metadata
information with the EC project information to be extracted by OpenAIRE.
The inference process returns to repository managers the list of file names for which it was possible to infer
at least one EC project, followed by the relative list of grant agreement numbers. The list can be provided in
several formats, including txt or Excel files, to be agreed with the OpenAIRE technical team. Repository
managers must write scripts capable of processing such list to complete the local database with the missing
associations between publications and EC projects. At this stage, repository managers may involve
researchers to confirm the result of the inference process and therefore enable a simplified and faster
The automatic inference service requires considerable CPU consumption in order to parse large sets of PDF
files and identify references to EC projects grant agreement numbers. To this aim, OpenAIRE exploits the
GRID power supported by the D4Science infrastructure, in turn powered by the gCube software system. For
further information, please visit the highlighted URLs.