Introduction to Systematic Literature Review method
Systematic Literature Review
Norsaremah Salleh, PhD.
Department of Computer science, IIUM
17th & 18th June 2014
Motivations for SLR
Steps in conducting SLR
What is a literature review (LR)?
• LR is a critical and in depth evaluation of previous research
• A summary and synopsis of a particular area of research,
allowing anybody reading the paper to establish the
reasons for pursuing a particular research.
• “An information analysis and synthesis, focusing on
findings and not simply bibliographic citations,
summarizing the substance of a literature and drawing
conclusions from it” (Educational Resources Information
• A good LR evaluates quality & findings of previous
Why doing LR?
• To establish connection or relationship between
“existing knowledge” and “the problem to be
• To refine the problem
• To identify significance of research
• To define research question(s)
Types of Literature Review
• Traditional Review (narrative)
• Systematic Literature Review or Systematic
• Systematic Mapping Study (Scoping Study)
• Tertiary Study
“Traditional” literature review
• Provides an overview of the research findings
on particular topics.
• Pros: produce insightful, valid syntheses of the
research literature if conducted by the expert .
• Cons: vulnerable to unintentional and
intentional bias in the selection, interpretation
and organization of content.
Systematic Literature Review (SLR)
What is a SLR?
◦ A process of identifying, assessing, and interpreting all
available research evidence, to provide answers for a
particular research question.
◦ A form of secondary study that uses a well-defined
methodology (Kitchenham et al., 2007).
SLRs are well established in other disciplines, particularly
medicine. They integrate an individual clinical expertise and
facilitate access to the outcomes of the research (Budgen et
“Traditional” Review Vs SLR
“the purpose of a SLR is to provide as complete a list as
possible of all the published and unpublished studies
relating to a particular subject area. While traditional
reviews attempt to summarize results of a number of
studies, systematic reviews use explicit and rigorous
criteria to identify, critically evaluate and synthesize
all the literature on a particular topic”.
(Cronin et al., 2008)
Systematic Mapping study
• Suitable for a very broad topic
• Identify clusters of evidence (making
• Direct the focus of future SLRs
• To identify areas for future primary studies
• Is a SLR of SLRs
• To answer a more wider question
• Uses the same method as in SLR
• Potentially less resource intensive
Hierarchy of Review
e.g. SLR of pair
programming studies in
e.g. a SLR of SLR in SE
Primary study#1 Primary study#3
Motivations for using SLR
To systematically accumulate, organize, evaluate, and
synthesize all existing research evidence of your research area.
To present fair evaluation of a research topic by using a
trustworthy, rigorous, and auditable methodology.
To produce reliable and unbiased results.
To identify gaps in the existing research that will lead to topics
for further investigation.
To provide as a background to position new research activities.
To support Evidence-based research.
Steps in conducting SLR
• Formulate the review’s research question
• Develop the review’s protocol
• Search the relevant literature
• Perform selection of primary studies
• Perform data extraction
• Assess studies’ quality
• Conduct synthesis of evidence
Write up the SLR report/paper
Guidelines for SLR
• Kitchenham, B. & Charters, S. (2007). Procedures for
performing systematic literature reviews in software
engineering. UK: Keele University and University of
• Petticrew, M., & Roberts, H. (2006). Systematic Reviews
in the Social Sciences: A Practical Guide: Blackwell
• Systematic Review. CRD’s guidance for undertaking
reviews in health care (2008).
1) Formulation of RQ
2) Protocol development
The Research Question (RQ)
• Is the most important part in any SLR
• Is not necessarily the same as question(s)
addressed in your research.
• Is used to guide the search process
• Is used to guide the extraction process
• Data analysis (synthesis of evidence) is expected
to answer your SLR’s RQ.
Formulation of RQ
• Features of good question:
– The RQ is meaningful and important to practitioners and
– The RQ will lead to changes in current practice or to
increase confidence in the value of current practice
– The RQ will identify discrepancies between commonly
held beliefs and the reality.
• RQ can be derived primarily based on researcher’s
interest. E.g. an SLR for PhD thesis should identify
existing basis for the research work and where it
fits in the current Body of knowledge.
Formulation of RQ
Petticrew & Robert (2006) suggest that the formulation of RQs
about effectiveness of a treatment should focus on 5 elements
known as PICOC:
Population (P) - the target group for the investigation (e.g.
people, software etc.).
Intervention (I) - specifies the investigation aspects or issues of
interest to the researcher(s).
Comparison (C)– aspect of the investigation with which the
intervention is being compared to.
Outcomes (O)– the effect of the intervention
Context (C)– the setting or environment of the investigation
Example of PICOC (Kitchenham et al.,
Population: Software or web project
Intervention: Cross-company project effort estimation model
Comparison: Single-company project effort estimation model
Outcomes: Prediction or estimate accuracy
Title: “Cross verses Within-Company Estimation Cost
Estimation Studies: A Systematic Review”
Example of PICOC (Salleh et al., 2011)
Population: CS/SE students in higher education
Intervention: Pair programming
Outcomes: PP’s effectiveness
Context: Review(s) of all empirical studies of PP within the domain of
CS/SE in higher education.
Title: “Empirical Studies of Pair Programming for
CS/SE Teaching in Higher Education: A Systematic
Example of RQs
Kitchenham et al (2007):
Q1: What evidence is there that cross-company estimation
models are not significantly different from within-company
estimation models for predicting effort for software/Web
Q2: What characteristics of the study data sets and the data
analysis methods used in the study affect the outcome of within-
and cross-company effort estimation accuracy studies?
Q3: Which experimental procedure is most appropriate for
studies comparing within- and cross-company estimation
Example of RQs (cont.)
Major et al. (2011) Systematic Literature Review: Teaching Novices
Programming Using Robots
RQ1: What computer languages are being taught in introductory
programming courses that make use of robots as teaching tools?
RQ2: Are the robots that are being used simulated or physical (real-life)?
RQ3: What are the characteristics (i.e. what is the age, level of education
etc.) of the novices being taught?
RQ4: What types of studies are being performed by researchers that
investigate the teaching of introductory programming concepts using robots?
RQ5: What is the scale (e.g. number of participants) of studies that are being
performed by researchers?
RQ6: Do collected studies suggest that using robotics to teach introductory
programming is effective? 22
Examples of RQ (cont.)
Salleh et al. (2011)
What evidence is there of PP studies conducted in higher education
settings that investigated PP’s effectiveness and/or pair compatibility in
Q1: What evidence is there regarding pair compatibility factors that affect
the effectiveness of PP as a CS/SE pedagogical tool and which pairing
configurations are considered as most effective?
Q2: How was PP’s effectiveness measured in PP studies and how effective
has PP been when used within higher education settings?
Q3: How was quality measured in the studies that used software quality as
a measure of PP’s effectiveness?
Examples of RQ (cont.)
Davis et al. (2006)
• Primary RQ: “What elicitation technique is most
efficient in a particular setting?”
• A plan that specifies the basic review
• Components of a protocol:
– Background, RQ, search terms, selection criteria,
quality checklist and procedures, data extraction
strategy, data synthesis strategy, project schedule.
• See example of SLR protocol.
1) Search relevant studies
2) Select relevant studies
3) Assess studies’ quality
4) Perform data extraction
5) Perform synthesis of evidence
1) Searching Relevant Studies
• Involves a comprehensive and exhaustive searching of studies
to be included in the review.
• Define a search strategy
• Consult the subject librarian(s)
• Search strategies are usually iterative and benefit from:
– Preliminary searches (to identify existing review and
volume of studies)
– Trial searches (combination of terms from RQ)
– Check the search results against list of known studies
– Consult the experts in the field
1) Searching Relevant Studies (Zhang et
1. Which approach to be used in search process
(e.g., manual or automated search)?
2. Where (venues or databases) to search, and
which part of article (field) should be searched?
3. What (subject, evidence type) to be searched,
and what are queries (search strings) fed into
4. When is the search carried out, and what time
span to be searched?
Common approach to construct search string
Derive major terms used in the review questions
based on the PICOC.
List the keywords mentioned in the article.
Search for synonyms and alternative words
Use the boolean OR to incorporate alternative
Use the boolean AND to link major terms
E.g. Search String
Salleh et al. (2011)
• The complete search term initially used :
(student* OR undergraduate*) AND (pair programming OR pair-
programming) AND ((experiment* OR measurement OR evaluation OR
assessment) AND (effective* OR efficient OR successful)
• A very limited number of results retrieved when using the
complete string, thus a much simpler string was derived.
• Subject librarian suggested to revise the search string:
“pair programming” OR “pair-programming”
E.g. Search String
Kitchenham et al. (2007) used their structured questions to
construct search strings for use with electronic databases:
Population: software OR application OR product OR Web OR WWW
OR Internet OR World-Wide Web OR project OR development
Intervention: cross company OR cross organisation OR cross
organization OR multiple-organizational OR multiple-organisational
model OR modeling OR modelling effort OR cost OR resource
estimation OR prediction OR assessment
Contrast: within-organisation OR within-organization OR within-
organizational OR within-organisational OR single company OR single
Outcome: Accuracy OR Mean Magnitude Relative Error
The search strings were constructed by linking the four OR lists using
the Boolean AND. 31
E.g. Search String
• In Kitchenham et al. (2007), the search string produced
was extremely long
• They attempted to produce a search string that was
very specific to their RQ but ended up getting a large
number of false positives.
• “Simpler search string might have been just as
Sources of Evidence
Reference lists from relevant primary studies and review
Journals (including company journals such as the IBM
Journal of Research and Development), grey literature (i.e.
technical reports, work in progress)
The Internet (google)
Direct contact specific researcher(s)
E.g. Sources of Evidence
Salleh et al. (2011)
Online databases used: IEEExplore, ACM Digital Library, Current
Contents, EBSCOhost, ISI Web of Science, INSPEC, ProQuest, Sage Full
text Collection, ScienceDirect, SpringerLink, Scopus
Other search engines used: Google scholar, Citeseer, Agile Alliance.
Some databases were selected based on previous studies we were
E.g. Sources of Evidence
Kitchenham et al. (2007)
The search strings were used on 6 digital libraries:
INSPEC , El Compendex, Science Direct, Web of Science, IEEExplore,
ACM Digital library
Search specific journals and conf. proceedings:
Empirical Software Engineering (J), Information and Software
Technology (J), Software Process Improvement and Practice (J),
Management Science (J), International Software Metrics Symposium
(C), International Conference on Software Engineering (C),
Evaluation and Assessment in Software Engineering (C)
Check references of each relevant article
Using Snowballing technique as a
complement to database searches
• Jalali & Wohlin (2012) recommend applying both backward
& forward snowballing in locating articles to be included in
• Snowballing approach requires a starting set of papers,
which should be based on identifying a set of papers from
leading journals in the area.
• Forward snowballing – i.e. identify articles that have
cited the articles found in the search
• Backward snowballing – i.e. identify articles from the
Set of studies found from database search
Issue of publication bias
What is “publication bias”?
Refers to the problem that positive results are more likely to
publish than negative results (Kitchenham & Charter, 2007).
E.g. formal experiments that was failed to reject null
hypothesis are considered less interesting.
Publication bias can lead to systematic bias in SLR
Strategies to address publication bias:
Scanning grey literature
Scanning conference proceedings
Contacting experts working in the area
• Use relevant Bibliographic package to manage
large number of references
• E.g. EndNote, Zotero, JabRef Reference
Documenting the Search
• The process of conducting SLR must be transparent and replicable.
• The review should be documented in sufficient detail
• The search should be documented and changes noted.
• Unfiltered search results should be saved for possible reanalysis
Data Source Documentation
Digital Library Name of Database, Search strategy, Date of search, years
covered by search
Journal Hand Searches Name of journal, Years searched
Title of proceedings/Name of conference, Journal name (if
published as part of a journal)
2) Selection of Studies
• Primary studies need to be assessed for their
• Set the criteria for including or excluding studies
(decided earlier during protocol development,
can be refined later)
• Inclusion & exclusion criteria should be based
• Selection process should be piloted.
• Study selection is a multistage process.
E.g. Selection of Studies
Kitchenham et al. (2007) used the following inclusion criteria:
any study that compared predictions of cross-company
models with within-company models based on analysis of
single company project data.
studies where projects were only collected from a small
number of different sources (e.g. 2 or 3 companies),
studies where models derived from a within-company data set
were compared with predictions from a general cost
E.g. Selection of Studies
Salleh et al. (2011)
• Inclusion criteria:
– to include any empirical studies of PP that involved higher
education students as the population of interest.
• Exclusion criteria:
– Papers presenting unsubstantiated claims made by the
author(s), for which no evidence was available.
– Papers about Agile/XP describing development practices
other than PP, such as test-first programming, refactoring
– Papers that only described tools (software or hardware)
that could support the PP practice.
– Papers not written in English.
– Papers involving students but outside higher education.
3) Assessing studies’ quality
– To provide more detailed Inclusion/Exclusion criteria
– To check whether quality differences provide an
explanation for differences in study results
– As a means of weighting the importance of
individual studies when results are being
– To guide the interpretation of findings and
determine the strength of inferences.
– To guide recommendations for further research.
Assessing studies’ quality
Quality relates to the extent to which the study minimizes bias
and maximizes internal and external validity (Khan et al. 2001) .
Quality Concepts Definition (Kitchenham & Charter, 2007)
Terms Synonyms Definition
Bias Systematic error tendency to produce results that depart systematically
from the ‘true’ results. Unbiased results are internally
Validity The extent to which the design and conduct of the
study are likely to prevent systematic error. Internal
validity is a prerequisite for external validity
The extent to which the effects observed in the study
are applicable outside of the study
Assessing studies’ quality
Assessing quality of studies:
◦ Methodology or design of the study
◦ Analysis of studies’ findings.
Quality checklist or instrument need to be designed to
facilitate quality assessment.
Most quality checklists include questions aimed at
assessing the extent to which articles have addressed
bias and validity.
E.g. Study Quality Assessment - Salleh et al. (2011)
1. Was the article referred?  Yes/No
2. Were the aim(s) of the study clearly stated? ,  Yes/No/Partially
3. Were the study participants or observational units adequately described? For
example, students’programming experience, year of study etc. , 
4. Were the data collections carried out very well? For example, discussion of
procedures used for collection, and how the study setting may have
influenced the data collected , , , 
5. Were potential confounders adequately controlled for in the analysis?  Yes/No/Partially
6. Were the approach to and formulation of the analysis well conveyed? For
example, description of the form of the original data, rationale for choice
of method/tool/package , , 
7. Were the findings credible? For example, the study was methodologically
explained so that we can trust the findings; findings/conclusions are
resonant with other knowledge and experience , , 
E.g. Study Quality Assessment
Kitchenham et al. (2007) constructed a quality questionnaire based
on 5 issues affecting the quality of the study :
1. Is the data analysis process appropriate?
2. Did studies carry out a sensitivity or residual analysis?
3. Were accuracy statistics based on the raw data scale?
4. How good was the study comparison method?
5. The size of the within-company data set (e.g < 10 projects
considered poor quality)
4) Data Extraction
Involve reading the full text article.
Data extracted from primary studies should be recorded using data
The form should be designed and piloted when the protocol is defined.
Collect all the information that can be used to answer the RQ and the
study’s quality criteria.
Both quality checklist and review data can be included in the same
In case of duplicates publications (reporting the same data), refer the
most complete one.
For validation, a set of papers should be reviewed by 2 or more
researchers. Compare results and resolve any conflicts.
Extraction Template (Cruzes et al., 2007)
• Study Type
• Strength of
1 * 1 *
5) Synthesis of evidence
Involves collating and summarizing the results of the
included primary studies.
Key objectives of data synthesis (Cruzes & Dyba, 2011):
– to analyze and evaluate multiple studies
– to select appropriate methods for integrating or
providing new interpretive explanations about them
• Types of Synthesis:
– Descriptive (narrative/qualitative, thematic analysis)
– Numerical synthesis (e.g. meta-analysis)
– “An approach to the synthesis of findings from multiple
studies that relies primarily on the use of words and text to
summarise and explain the findings of the synthesis. It
adopts a textual approach to the process of synthesis to
‘tell the story’ of the findings from the included studies.” –
Popay et al. 2006
• Methods proposed for synthesis of qualitative studies: Meta-
ethnography, thematic analysis, content analysis, narrative
• It is a method that involves induction and interpretation.
• The outcome is the translation of studies into one
another, which encourages the researcher to understand
and transfer ideas, concepts and metaphors across
• Studies can relate to one another in one of three ways:
– they may be directly comparable as reciprocal translations
– they may stand in opposition to one another as refutational
– taken together they may represent a line of argument.
• Example of SLR paper that have used meta-ethnography
Numerical Synthesis (Meta-Analysis)
• Meta-analysis can be used to aggregate results or to
pool data from different studies.
• The aim is to resolve uncertainty when the results of
studies disagree; and to increase confidence in the
results obtained from individual studies
• The outcome of a meta-analysis is an average effect
size (ES) with an indication of how variable that effect
size is between studies.
• ES can be calculated based on Hedges or Cohen’s
approach; i.e. the standardized mean difference
between the two groups.
• Meta-analysis involves three main steps:
1. Decide which studies to be included in the meta-
2. Estimate an effect size for each individual study.
3. Combine the effect sizes from the individual studies
to estimate and test the combined effect.
• Results of the meta-analysis can be presented in a
• Software tools: Comprehensive Meta-Analysis (CMA),
MIX, OpenMeta [Analyst], Review Manager (RevMan)
from Cochrane Collaboration, Stata, R.
Meta-analysis of PP’s effectiveness (Salleh et al.)
The pooled ES for this MA=0.67 (medium effect size).
Effect size category (Kampenes et al., 2006):
Small (effect size of 0.000 – 0.376)
Medium (effect size of 0.378 – 1)
Large (effect size of 1.002 – 3.40).
REPORTING THE RESULTS
1) Reporting Results In Academic
2) Reporting Structure
3) Examples of Published SLR
Reporting SLR results in
• Some conferences and journals include a specific
topic/area on SLR.
– Information & Software Technology (IST) has an
editor specializing in systematic reviews.
– Int’l Symposium on Empirical Software Engineering
& Measurement (ESEM)
– IEEE Transactions on Education
– General introduction about the research. State the
purpose of the review. Emphasize the reason(s) why the
RQ is important. State the significance of the review
work and how the project contributes to the BOK of the
• Main Body
– Review method – briefly describe steps taken to conduct
– Results – Findings from the review, i.e. the synthesis
– Discussion – implication of review for research &
practice, Threats to Validity
Other examples of SLR
• Cross versus Within-Company Cost Estimation Studies: A
Systematic Review (Kitchenham et al., 2007)
• A Systematic Review of Theory Use in Software Engineering
Experiments (Hannay et al., 2007)
• Motivations in software engineering: a Systematic Literature
Review (Beecham et al., 2008)
• Effectiveness of Requirements Elicitation Techniques: Empirical
Results Derived from a Systematic Review (Davis et al., 2006)
• A systematic review of software development cost estimation
studies (Jorgensen & Shepperd, 2007)
• The effectiveness of pair programming: A meta-analysis (Hannay
et al., 2009).
Budgen,D., Charters, S., Turner, M.,, Brereton, P., Kitchenham, B., Linkman, S.
(2006) Investigating the applicability of the evidence-based paradigm to
software engineering, Proc. 2006 Int’l Workshop on interdisciplinary software
engineering research, pp. 7 – 13.
Cruzes, D.S., and Dyba, T. (2011) Research Synthesis in Software Engineering: A
tertiary study, Information and Software Technology, 53(5), pp. 440-455.
Cochrane Collaboration. Cochrane Reviewers’ Handbook. Version 4.2.1.
Cohen, J. Weighted Kappa: nominal scale agreement with provision for scaled
disagreement or partial credit. Pychol Bull. (70) 1968, pp. 213-220.
Davis, A., Dieste, O., Hickey, A., Juristo, N., Moreno, A.M. (2006) Effectiveness of
Requirements Elicitation Techniques: Empirical Results Derived from a
Systematic Review, 14th IEEE Requirements Engineering Conference , pp 176 –
Dyba, T., & Dingsoyr, T. (2008). Strength of evidence in systematic reviews in
software engineering. Proceedings of the 2nd Int'l Symp. Empirical Software
Engineering & Measurement (ESEM 2008), 178-187.
Dyba, T., Kitchenham, B. A., & Jorgensen, M. (2005). Evidence-based
software engineering for practitioners. IEEE Software, 22(1), 58-65.
Jalali, S. & Wohlin, C. (2012). Systematic Literature Studies: Database
Searches vs.Backward Snowballing, Proceedings of the 6th Int'l Symp.
Empirical Software Engineering & Measurement (ESEM 2012).
Kampenes, V. B., Dyba, T., Hannay, J. E., & Sjoberg, D. I. K. (2007). A
systematic review of effect size in software engineering. Information and
Software Technology, 49, 1073-1086.
Khan, K.S., ter Riet, Gerben., Glanville, Julia., Sowden, Amanda, J. and
Kleijnen, Jo. (eds) Undertaking Systematic Review of Research on
Effectiveness. CRD’s Guidance for those Carrying Out or Commissioning
Reviews. CRD Report Number 4 (2nd Edition), NHS Centre for Reviews
and Dissemination, University of York, IBSN 1 900640 20 1, March 2001.
Kitchenham, B. & Charters, S. (2007). Procedures for performing
systematic literature reviews in software engineering. UK: Keele
University and University of Durham.
Kitchenham, B., Mendes, E., Travassos, G.H. (2007) A Systematic Review
of Cross- vs. Within-Company Cost Estimation Studies, IEEE Transactions
on Software Engineering, 33 (5), pp 316-329.
Martyn Shuttleworth (2009). What Is A Literature Review?. Retrieved 29
Jul. 2012 from Experiment Resources: http://www.experiment-
Noblit, G.W. & Hare, R.D. Meta-ethnography: Synthesizing Qualitative
Studies, Sage, 1988.
Petticrew, M., & Roberts, H. (2006). Systematic Reviews in the Social
Sciences: A Practical Guide: Blackwell Publishing.
Pickard, L. M., B. A. J. Kitchenham, et al. (1998). "Combining empirical
results in software engineering." Information and Software Technology
Salleh, N., Mendes, E., & Grundy, J. (2011). Empirical studies of pair
programming for CS/SE teaching in higher education: A systematic
literature review. IEEE Transactions on Software Engineering, 37(4): 509 –