In 2012 Patrick OBrien and I published a paper titled “Invisible Institutional Repositories…” where we showed that many IRs are not being harvested very well by search engines, particularly Google Scholar, and we figured out what was wrong and demonstrated a solution…
In short, IR’s are not offering search engines appropriately structured metadata. Search engines like Google Scholar want a structured citation so that they can ingest and then deliver that citation to their users in any style the users want. Many IR were built using Dublin Core metadata, which simply doesn’t have fields for every part of a citation, and the tendency has been to lump entire citations into a single Dublin Core field, usually the Source field. This creates a block of data that is not machine comprehensible.In fact Google Scholar published a guideline a few years ago that said to use Dublin Core “only as a last resort.” It recommends using one of four metadata schema: Highwire Press, PRISM, ePrints, or BePress.If search engines are having these problems then altmetrics services like ImpactStory will likely also have them.We conducted the experiment and demonstrated that converting citations to Highwire Press and making them available in HTML meta tags works. The problem is that many IR’s are large and converting all those citations would be difficult to do manually. We think automated parsing methods are the solution and we are actively working on a process that will be replicable elsewhere. At the same time we’re trying to develop this citation parsing process we are also helping Montana State migrate to Activity Insight from a company called Digital Measures. It’s a new faculty activity database that replaces the old home-grown faculty activity database. MSU faces the problem of migrating citations from the old FAD but also figuring out how to parse citations directly from CV’s and resumes, because the old FAD didn’t have a very high participation rate. Our job is to deal with the citations on the CV’s, and as you can imagine they come in a variety of styles and formats.
So, basically, we’re trying to take citations from IR metadata and CV’s, parse them accurately and import into Activity Insight. Then we have a hook built from Activity Insight into our IR. Once the citations are in the IR in a proper, structured format they can be harvested by search engines and altmetrics services. Why this direction? There is a mandate for faculty to use Activity Insight. There is no mandate for them to use the IR. And of course, we want to make this as painless as possible for them.In summary, the points we want to make are these:Citations are poorly structured in many IR’s We think this damages IR use because scholars aren’t finding the articles in search enginesIt is very difficult and time-consuming to convert metadata manually, so we are developing an automated parsing process using open source softwareWe are working effectively with the campus Office of Planning and Analysis to help make Activity Insight a success, which in turn will help our IRAppropriately structured data will help IR material get harvested and cited, and probably also improve altmetrics
Are the metadata ready?
Arlitsch, Kenning and Patrick S. OBrien. (2012). Invisible Institutional
Repositories: Addressing the Low Indexing Ratios of IRs in Google Scholar.
Library Hi Tech, 30(1), 60-81
Many IR don’t offer metadata that
SE’s can identify, parse and digest
Wolfinger, N. H., & McKeever, M. (2006, July). Thanks for nothing: changes in income and
labor force participation for never-married mothers since 1982. In 101st American
Sociological Association (ASA) Annual Meeting; 2006 Aug 11-14; Montreal, Canada (No.
2006-07-04, pp. 1-42). Institute of Public & International Affairs (IPIA), University of Utah.