An Impact-Based Filtering Approach For Literature Searches
1. http://lis.sagepub.com/
Journal of Librarianship and Information Science
http://lis.sagepub.com/content/early/2012/06/04/0961000612448207
The online version of this article can be found at:
DOI: 10.1177/0961000612448207
published online 5 June 2012
Journal of Librarianship and Information Science
Alvin Vista
An impact-based filtering approach for literature searches
Published by:
http://www.sagepublications.com
can be found at:
Journal of Librarianship and Information Science
Additional services and information for
http://lis.sagepub.com/cgi/alerts
Email Alerts:
http://lis.sagepub.com/subscriptions
Subscriptions:
http://www.sagepub.com/journalsReprints.nav
Reprints:
http://www.sagepub.com/journalsPermissions.nav
Permissions:
What is This?
- Jun 5, 2012
OnlineFirst Version of Record
>>
at The University of Melbourne Libraries on October 11, 2012
lis.sagepub.com
Downloaded from
3. 2 Journal of Librarianship and Information Science 0(0)
aim to bound the scope of the review based on some sam-
pling criteria (e.g. types of sources, language, types of
results reported). In addition, topic-level limiters are useful
in narrowing the search results, and these include the
above-mentioned coverage approaches as well as using
Boolean operators (i.e. OR, AND, etc.) to further specify
the search parameters. Thus, the scope restriction suggested
in this paper comes after any topic-level limiters and cover-
age approaches had been implemented, and is primarily
concerned with further reducing the number of sources
resulting from searches in topics which return particularly
numerous search hits. This is especially noticeable for
searches about methods (e.g. statistical topics, experimental
protocols, scientific techniques) where results can run into
the thousands even after the scope of the search had previ-
ously been limited. Current literature searches usually
involve some form of date-range filters to restrict the scope
of the search to more manageable numbers.
In order to restrict the scope based on impact, a form of
citation ranking or benchmark method needs to be used, in
turn implying the need for citation databases. For years the
Thompson citation index, or the ISI Web of Science (WoS),
has been the de facto leader in providing data for citation
analyses (Meho and Yang, 2007). A comprehensive exami-
nation of WoS and Scopus conducted by Meho and Yang
(2007) found that, at the time of their study, Scopus and
WoS contained over 36 and 28 m records, respectively
going as far back as 1900 for WoS and 1966 for Scopus in
nearly every major field of study. These databases continue
to expand rapidly and will continue to do so in the future as
open access becomes increasingly popular.1
The emergence of Google Scholar enabled a significant
shift in citation analysis by providing free access to citation
research while using Google’s massive web-crawling capa-
bilities with the same automated software (crawlers and
parsers) that power Google search (Google Scholar, 2011).
There have been a number of studies comparing the results
from Google Scholar with WoS (Harzing and Wal, 2008;
Noruzi, 2005; Pauly and Stergiou, 2005; Schroeder, 2007)
as well as multi-index comparison that includes other data-
bases such as Scopus and PubMed (see Falagas et al., 2008;
Levine-Clark and Gil, 2009; Meho and Yang, 2007). These
studies show that while each database, Google Scholar
included, has particular strengths and weaknesses, they
tend to overlap in terms of coverage and complement each
other in terms of features (Meho and Yang, 2007; Noruzi,
2005). The main criticisms of Google Scholar tend to
revolve around inconsistent coverage and results (Jacso,
2005), and comparatively less transparent database charac-
teristics because it does not publish detailed data and statis-
tics on its sources (Jacso, 2005; Meho and Yang, 2007).
Nevertheless, one main advantage of Google Scholar is that
it is the only major database with general coverage that is
freely accessible,2 thus providing a source of data that can
be accessed by third party platforms (Harzing and Wal,
2008; Schroeder, 2007).
By using citation counts, one of the earliest and simplest
impact metric (e.g. Cawkell, 1968; E. Garfield, 1970;
Gross and Gross, 1927), this paper proposes a method of
filtering that is both fast and efficient while also being
more relevant than other limiters that restrict the scope
based solely on date. Objective assessment of scholarly
impact is a rich area of research that continues to evolve.
Quantifying this impact and significance of scientific out-
put typically includes methods that are based on some form
of citation analysis (Pudovkin and Garfield, 2009; Radicchi
et al., 2008). To implement the impact-based filter, a freely
available tool called Publish or Perish (Harzing, 2011) is
used.3 Publish or Perish software enables automated cita-
tion counts, among other metrics, using data from Google
Scholar and presents them in tabular form that can be
exported to text or spreadsheet formats (Harzing and Wal,
2008).
This paper aims to introduce a simple method of com-
bining effective filtering with automated bibliographic
search and although Publish or Perish is the only citation
search tool presented here, it is not the purpose of this
paper to advocate any specific software for these purposes.
Publish or Perish is not the only citation analysis software
available, but it is one of the most user friendly. Invariably
though, since it uses data only from Google Scholar, it
shares whatever weakness Google Scholar has concerning
the quality of its citation results (see Jacso, 2005 particu-
larly for the significant shortcomings of Google Scholar).
In this regard, one might consider alternative software that
uses other indices. One notable software that uses search
results from WoS is HistCite, which was developed by
Eugene Garfield from his work on algorithmic historiogra-
phy (2001) and freely made available through Thomson
Reuters.4
The choice of Publish or Perish for this paper is mainly
due to its ease of use features and being freely available
software. This paper does not discuss the pros and cons of
this software (or any other similar software) nor does it
advocate its specific use. While this paper presents results
of the impact-based search approach as implemented by
Publish or Perish, the concept is not limited to any particu-
lar software and can be easily implemented in HistCite
(Garfield, 2004; Garfield and Pudovkin, 2004; Garfield
et al., 2003). One approach using HistCite for an impact-
based search would be by using filters based on global cita-
tion scores (GCS) which is based on ISI WoS citation
counts5 (Garfield, 2010).
Brief comparison of results
To present a brief comparison of this impact-based search
protocol with the more traditional date-range filters for
at The University of Melbourne Libraries on October 11, 2012
lis.sagepub.com
Downloaded from
4. Vista 3
search, a simulation of a search on a particular topic was
conducted. The main topic for this search is to gather all
available literature with the phrases logistic regression and
differential item functioning both present (equivalent to
using a Boolean AND). The simulation for both search
approaches was conducted in April 2011. The differences
between the protocols for the two approaches are outlined
below.
The date-range delimited search was conducted using
theSuperSearchplatformfromtheUniversityofMelbourne.
This platform is able to search simultaneously across online
library databases. In this particular search, six databases
were accessed (Table 1 presents the database included in
this search and a brief description of each) with initial hits
totalling 1143. For the date range, we use three ranges – 5
years, 10 years and 15 years (2006–2011, 2001–2011 and
1996–2011, respectively). For the longest range (15 years,
year of publication ≥ 1995), the reduction in result hits was
not substantial with 509 sources falling within the range.
Narrowing the range to only 10 years reduced the results
further to 476, but it was only after further restricting the
results to include only publication from 2006 and newer
(5-year range) that the number of results became more
manageable at 399.
For the impact-based filter, we use a citation count
threshold6 of 50 and a citation per year threshold of
5 per year.7 Having two thresholds, one using a simple
total citation count and the other taking into account the
average citation count per year balances the age effect
(due to accrual over time) of citations (Craig et al., 2007).
In addition, by limiting the search to only a few and related
fields of study, we avoid the complexity of having to inter-
pret results across fields with widely different bibliometric
conventions (Zitt et al., 2005). Publish or Perish has the
capacity to limit the search based on broad fields of study
(see Figure 1 for Publish or Perish interface). In this
search, the results were limited to those sources in social
sciences, arts and humanities.
The impact-based search returned 640 results with at
least one citation. Using the threshold of 50 citations, the
results were reduced to a more manageable 69 sources.
Using a different metric of cites per year and with a thresh-
old of five cites per year, 104 sources were included.
Figures 2 and 3 present the results from each search
approach visually. Figure 2 shows the actual numbers of
the raw results and the reduced numbers after the filter
thresholds were applied. Figure 3 shows the comparative
reductions in percentage of the raw results that were
excluded after the filters were applied (a larger percentage
would indicate a more efficient filter).
Figure 4 shows the distribution of the results publication
year for the initial results and Figure 5 shows the distribution
Table 1. Summary and brief descriptions of the databases used for the date-based filter.
Database Description
Education Research Complete (EBSCO) Education Research Complete is a bibliographic and full-text database
covering scholarly research and information relating to all areas of education.
Topics covered include all levels of education from early childhood to higher
education, and all educational specialties, such as multilingual education, health
education and testing.The database also covers areas of curriculum instruction
as well as administration, policy, funding, and related social issues.
ERIC (CSA) The ERIC (Educational Resources Information Center) database is sponsored
by the US Department of Education to provide extensive access to
educational-related literature.The ERIC database corresponds to two printed
journals: Resources in Education (RIE) and Current Index to Journals in
Education (CIJE). Both journals provide access to some 14,000 documents and
over 20,000 journal articles per year.
Expanded Academic ASAP (Gale) The Expanded Academic ASAP database meets research needs across all
academic disciplines – from arts and the humanities to social sciences, science
and technology. Available sources include scholarly journals, news magazines,
and newspapers.
PsycINFO (CSA) PsycINFO provides access to international literature in psychology and related
disciplines.The database includes literature from an array of disciplines related
to psychology such as psychiatry, education, business, medicine, nursing,
pharmacology, law, linguistics and social work.
SCOPUS (Elsevier) Scopus is a comprehensive scientific, medical, technical and social science
database containing all relevant literature.
Web of Science (ISI) Web of Science provides seamless access to the Science Citation Expanded®,
Social Sciences Citation Index®, and Arts & Humanities Citation Index™.
Source: SuperSearch, University Library,The University of Melbourne.
at The University of Melbourne Libraries on October 11, 2012
lis.sagepub.com
Downloaded from
5. 4 Journal of Librarianship and Information Science 0(0)
when the impact-based limiters were applied. The impact-
based approach shows sources with a more evenly distrib-
uted span across publication years, even if the initial results
were skewed to newer sources similar to the results based on
date filters (Figure 6). Comparing Figure 5 with Figure 6
shows that the proportional number of search results is more
uniform in the impact-based filter than the simple date filter.
This comparison does not necessarily imply that a more uni-
form distribution is better, but it shows that in simple time
Figure 1. Graphical interface of Publish or Perish (v 3.1.4097) showing the general citations search options.
Note:The main filter options include keyword combinations, time frame, and fields of research.
Figure 2. Comparison of raw and filtered result numbers by
search filter used.
Figure 3. Comparison of percentage reductions in result
numbers by search filter used.
at The University of Melbourne Libraries on October 11, 2012
lis.sagepub.com
Downloaded from
6. Vista 5
filters, older publications may be disproportionally out-
numbered in terms of search results in topics that have
undergone a recent surge in popularity.
We have to keep in mind that the comparative efficiency
in this context still does not take into account the actual qual-
ity or relevance of the results. While a large reduction is
more advantageous from a purely logistic consideration,
our final goal still remains the quality of the literature that
we aim to include in a literature review. In this regard,
there is an inherent advantage with the impact-based filtering
approach. If we set up to compare the date-based filter with
the impact-based filter as an initial search result management
protocol, the impact-based filter will provide us not only
with a more refined filter but also a set of results that are, by
definition, well cited by the research community — an
implicit indicator of quality (Cawkell, 1968; Garfield, 1970).
Discussion
It is conceivable that using an impact-based filter can mag-
nify cumulative advantage of more established authors
at the expense of new ones (also known as the ‘Matthew
effect’, Merton, 1988), the dynamics of which is elabo-
rated in detail in Price’s (1976) statistical model of the
distribution for phenomena where success breeds success.
However, as elaborated in previous sections, this paper
does not advocate or even suggest the sole use of impact-
based filters to search for literature. The utility for this
approach is mainly for literature search topics that gener-
ally return an unmanageable number of hits and only after
content-relevant filters have been implemented. In other
words, if an initial literature search returns only a reason-
able number of results, traditional methods of evaluating
the literature from the initial search should be used.
In this paper’s simulated preliminary search for a litera-
ture review, initial hits resulted in numbers that could
easily appear overwhelming, especially considering the
search terms of ‘logistic regression’AND ‘differential item
functioning’ are already quite specialized. It is easy to
imagine that for more general search topics, initial hits
could very well run into the high thousands. Indeed, a
quick search using the same procedure but with less spe-
cialized queries was conducted as comparison. Searching
for ‘bilingual education’ with a time-based filter of publi-
cations between 2006 and 2011 returned 3254 hits — an
unmanageable number without further filtering. An impact-
based search with a wider time frame (2001–2011) was
conducted for comparison. A less conservative threshold of
10 citations (less conservative compared to the original 50
citations) returned only 586 hits. Using the second threshold
of two citations per year (again, less conservative vs the orig-
inal five per year) further reduced the returned hits to just
420. This particular result shows that we can get far more
manageable numbers even when the thresholds were loos-
ened up and the time frame for the search was expanded.
Figure 4. Raw Publish or Perish results by publication year.
Figure 5. Published or Perish results filtered by number of
citations and citations per year.
Note:Total numbers based on 50 cites or 5 cites/year.
Figure 6. SuperSearch results filtered by 15-year date range.
at The University of Melbourne Libraries on October 11, 2012
lis.sagepub.com
Downloaded from
7. 6 Journal of Librarianship and Information Science 0(0)
Initial hits for both search approaches presented here
were comparative, but when the filters were applied, the
impact-based approach resulted in a larger reduction of ini-
tial hits. This reduction on its own will not be significant
unless the filtered results are actually relevant to the search
objectives. Thus, a filter based on impact offers inherently
advantageous efficiency in filtering search results com-
pared to filters based on date ranges alone. Before the
advent of Google Scholar, researchers did not have a read-
ily available and free source of citation metrics, and thus it
is understandable if search filters used date ranges to refine
the results. Newer technologies now allow us to use more
relevant filters and this paper aims to present an impact-
based search filter that, using the Publish or Perish tool,
allows researchers to filter unmanageable initial search
results just as easily as using more traditional filtering
approaches.
Future research in this area may prove to be exciting as
the scope of literature databases becomes increasingly
comprehensive and, hopefully, more open. As Google and
other search engines index more non-traditional literature
sources, the definition of ‘impact’ could also change dra-
matically as search indices could include not just published
sources but citations that appear in purely web-based plat-
forms, for example. More comprehensive studies compar-
ing automated searches with human searches in terms of
result-relevance are also very useful, especially if the com-
parison metric takes into account search-efficiency (i.e.
what is the balance between quality of final search selection
and resources consumed in undertaking the search). It is
hoped that this article provides a useful alternative to tradi-
tional literature search methods but also that it will elicit the
interest among researchers to further explore this area of
citation analysis in the context of automated search.
Funding
This research received no specific grant from any funding agency
in the public, commercial or not-for-profit sectors.
Notes
1. Thomson Reuters, the service provider that manages Web of
Science under Web of Knowledge puts the latest figure for
WoS at over 49.4m records as of 2011 (Thomson Reuters,
2012).
2. PubMed is free but it is more focused on biomedical fields,
plus in terms of content, it has the least number of journals
covered among the four mentioned databases (Falagas et al.,
2008; Harzing and Wal, 2008).
3. The software can be downloaded free at http://www.harzing.
com/pop.htm
4. The most recent version is downloadable from: http://thomson-
reuters.com/products_services/science/science_products/a-z/
histcite/).
5. Because HistCite uses search results from WoS, one needs to
have access to the WoS site.
6. The threshold values are arbitrary, as the searches were
performed merely to illustrate the method rather than as an
actual literature search. Actual thresholds would be defined
by the researcher based on their requirements and logistical
considerations (e.g. narrower or broader depending on their
capacity).
7. Publish and Perish computes average cites per year as total
number of cites divided by years of publication. Thus, the
threshold means an average of five (or more) cites per year.
References
Cawkell A E (1968) Citation practices. Journal of Documentation
24 (Dec): 299–302.
Cooper HM (1988) Organizing knowledge synthesis: A taxonomy
of literature reviews. Knowledge in Society 1(1): 104–126.
Craig ID, Plume AM, McVeigh ME, et al. (2007) Do open access
articles have greater citation impact? Journal of Informetrics
1(3): 239–248.
Falagas ME, Pitsouni EI, Malietzis GA, et al. (2008) Com-
parison of PubMed, Scopus, Web of Science, and Google
Scholar: Strengths and weaknesses. FASEB Journal 22(2):
338–342.
Garfield E (1970) Citation indexing for studying science. Nature
227: 669–671.
Garfield E (2001) From computational linguistics to algorithmic
historiography. University of Pittsburgh Lazerow Lecture.
Available at: http://garfield.library.upenn.edu/papers/pitts-
burgh92001.pdf (accessed 7 April 2011).
Garfield E (2004) Historiographic mapping of knowledge domains
literature. Journal of Information Science 30(2): 119–145.
Garfield E (2010) Historiograph compilation HistCite guide.
Available at: http://www.garfield.library.upenn.edu/histcomp/
guide.html (accessed 7 April 2011).
Garfield E and Pudovkin AI (2004) The HistCite System
for mapping and bibliometric analaysis of the output of
searches using the ISI Web of Knowledge. Paper presented
at the Annual meeting of ASIS&T, 12–17 November, 2011,
Newport, RI, USA.
Garfield E, Pudovkin AI and Istomin VS (2003) Mapping out-
put of topical searches in the Science Citation Index, Social
Sciences Citation Index, Arts and Humanities Citation Index.
Paper presented at the Special Libraries Association meeting,
10 June 2003, New York, USA.
Google Scholar (2011) Inclusion guidelines for webmasters. Avail-
able at: http://scholar.google.com.au/intl/en/scholar/inclusion.
html (accessed 9 April 2011).
Gross PLK and Gross EM (1927) College libraries and chemical
education. Science 66: 385–389.
Harzing AW (2011) Publish or Perish (Version 3.1.4): Tarma
Software Research Pty Ltd.
Harzing AW and Wal R van der (2008) Google Scholar as a new
source for citation analysis? Ethics in Science and Environmen-
tal Politics 8(1): 62–71.
Hemingway P and Brereton N (2009) What is a Systematic
Review? London: Hayward Medical Communications.
at The University of Melbourne Libraries on October 11, 2012
lis.sagepub.com
Downloaded from
8. Vista 7
Jacso P (2005) Google Scholar: The pros and the cons. Online
Information Review 29(2): 208–214.
Levine-Clark M and Gil E (2009) A comparative citation analysis
of Web of Science, Scopus, and Google Scholar. Journal of
Business & Finance Librarianship 14(1): 32–46.
Meho LI and Yang K (2007) A new era in citation and bibliomet-
ric analyses: Web of Science, Scopus, and Google Scholar.
Journal of the American Society for Information Science and
Technology 58(13): 2105–2125.
Merton R (1988) The Matthew effect in science. Science
159(3810): 56–63.
Noruzi A (2005) Google Scholar: The new generation of citation
indexes. Libri 55(4): 170–180.
Pauly D and Stergiou KI (2005) Equivalence of results from two
citation analyses: Thomson ISI’s Citation Index and Google
Scholar’s service. Ethics in Science and Environmental
Politics 5: 33–35.
Price DDS (1976) A general theory of bibliometric and other
cumulative advantage processes. Journal of the American
Society for Information Science 27(5): 292–306.
Pudovkin A and Garfield E (2009) Percentile rank and author
superiority indexes for evaluating individual journal articles
and the author's overall citation performance. Paper pre-
sented at the Fifth international conference on webometrics,
informetrics and scientometrics, 13–16 June, 2009, Dalian,
China.
Radicchi F, Fortunato S and Castellano C (2008) Universality of
citation distributions: Toward an objective measure of scien-
tic impact. Proceedings of the National Academy of Sciences
105(4): 17268–17272.
Randolph JJ (2009) A guide to writing the dissertation literature
review. Practical Assessment, Research & Evaluation 14(13):
1–13. Available at: http://pareonline.net/getvn.asp?v=14&n=13
(accessed 30 April 2012).
Schroeder R (2007) Pointing users toward citation searching:
Using Google Scholar and Web of Science. Portal: Libraries
and the Academy 7(2): 243–248.
Thomson Reuters (2012) Web of Knowledge: Quality and quantity.
Available at: http://wokinfo.com/realfacts/qualityandquantity/
(accessed 23 February 2012).
Zitt M, Ramanana-Rahary S and Bassecoulard E (2005) Relativity
of citation performance and excellence measures: From cross-
field to cross-scale effects of field-normalisation. Scientometrics
63(2): 373–401.
Author biography
AlvinVistaiscurrentlypursuingaPhDinEducationalMeasurement
at the University of Melbourne. He obtained his MAin Educational
Psychology from the University of Georgia as a Fulbright scholar
from the Philippines. He is currently professionally engaged as
part of a research team at the Assessment Research Centre,
University of Melbourne.
at The University of Melbourne Libraries on October 11, 2012
lis.sagepub.com
Downloaded from