Tests
Carlos Matos
Lecture 8
PC3001, PC4001, PC5001, CS2847
Goals of
evaluation
Assess extent of systemfunctionality
Assess effectof an interface on users
Identify specific problems
Evaluation
techniques
Evaluation
• tests usability and functionalityof system
• occurs in laboratory, field and/or in collaboration
with users
• evaluates both design and implementation
• should be considered at all stages in the design life
cycle
Evaluation: tests
Evaluating designs
Evaluating
designs
Cognitive walkthrough(task-specific)
Heuristic evaluation (holistic)
Review-based evaluation (holistic)
Cognitive
walkthrough
(task-specific)
Based on the concept that users prefer to learn
by doing rather than read manuals
Evaluates design on how well it supports users
in learning specific tasks
Performed by a group that includes the UI
designers and developers, led by experts in
cognitive psychology
The group ‘walks through’ the design to identify
potential problems using psychological
principles
Forms are used to guide the analysis on each
step of the tasks
Heuristic
evaluation
(holistic)
Proposed by Nielsen and Molich
Compares the UI design against
accepted usabilityprinciples
Identify usability criteria (heuristics)
Experts check that the design meets the
criteria
Example heuristics
• Consistencyand standards
• Match between systemand the real world
• Error prevention
Heuristic evaluation ‘debugs’ design
Review-based
evaluation
(holistic)
Expert-based evaluation method
Results from experimental results and
empirical evidence, found in the literature to
validate design methods
Care needed to ensure results are
transferable to new design
Model-based evaluation
Cognitive models used to filter design
options
• e.g. GOMS1 prediction of user performance
Design rationale can also provide useful
evaluation information
1GOMS
"a set of Goals,a set of Operators,aset of Methods forachievingthegoals, and a set of Selectionsrules for choosing among competingmethods forgoals."
Evaluation: Tests
Evaluating Implementations
9
Empirical
methods in
HCI
Lab experiment
• Artificial, highly controlledby experimenter
Field study
• Occurs in the actual environment people use the UI
and with real tasks
Survey
• Questionnaire, conductedby paper, phone, web, or
in person
Quantifying
usability
Learnability
• Easy (quick) to learn?
Efficiency
• Fast to use after learning?
Errors
• Number of errors
Satisfaction
• Degree of satisfaction reported by users
Controlled
experiment
Controlled evaluation of specific aspects of
interactive behaviour
Evaluator chooses hypothesis to be tested
Manipulate independent variables
• Different placement, font size, input
Measure dependent variables
• Times, #errors, #tasks done, satisfaction
Use statistical analysis to accept or reject the
hypothesis
• How changes in independent variables affect the
dependent variables – are those effects significant?
Designing the
experiment
Process
Y = f(x) + 
independent
variables
x
dependent
variables
Y
unknown/uncontrolled
variables

Designing the
experiment
Subjects/users: who – representative, sufficient
sample
Implementation: real environment, artificial
variations
Tasks
• Real tasks: word processing, e-mail, web browsing
• Artificial: users focus on a simple subset of tasks
Measuring: how to count time, #clicks, #errors
Ordering: of conditions and tasks
Hardware: physical conditions of the test,
available inputs
Hypothesis
Prediction of outcome
• Framed in terms of independent and dependent
variables
e.g. “error rate will increase as font size decreases”
Null hypothesis
• States no difference between conditions
• The aim is to disprove this
e.g. null hypothesis = “no change of error rate with
font size”
A/B Testing
Experiment based on two alternative
interfaces
• Normally A is the controland B is the variation
In Web design, this is normally used to
identify improvementsthat can maximise a
certain outcome of interest
Normally the currentversionof the interface
is associated with the null hypothesis
Concerns
Internal validity
Are observed results actually caused by the
independent variables?
External validity
Can observed results be generalised to the world
outside the lab?
Reliability
Will consistent results be obtained by repeating
the experiment?
Threats to
Internal
Validity
Ordering effects
• People learn, and people get tired
• Randomise or counterbalance ordering
Selection effects
• Avoid pre-existinggroups (unless the group is an
independent variable)
• Randomly assign users to independent variables
Experimenter bias
• Experimenters may prefer an hypothesis to be
proven valid
• Double blind experiment is quite hard for HCI
• Controlprotocol
Threats to
External
Validity
Population
• Draw a random sample from the real target
population
Ecological
• Make lab conditions as realistic as possible
Training
• Training should mimic how the real interface would
be encounteredand learned
Task
• Tasks for testingshould be based on task analysis
Threats to
Reliability
Uncontrolled variation
• User differences
• Task design
• Measurement error
Solutions
• Eliminate uncontrolled variation
Select users by experience
Give consistent training
Measure dependent variables precisely
• Repetition, repetition
Many users, many trials
Standard deviation of the mean shrinks like the square
root of N (i.e. quadrupling #users makes the mean
twice as accurate)
Blocking
Divide samples into subsets that are more
homogeneous than the whole set
• Example: testing wear rate of different shoe sole
material
Lots of variation between feet of different people, but
the feet on the same person are more homogeneous
Apply all conditions within each block
• Test material A on one foot, material B on the other
Measure difference within block
• Wear(A) – Wear(B)
Randomise within the block to eliminate validity
threats
• Randomly put A on left or right foot
Between-
subjects
experiment
Each subject performs the experiment under
only one condition
Results are compared between different
groups
• Is mean(xi) > mean (yj) ?
No transfer of learning
More users required
Variation can bias results
Within-
subjects
experiment
Each subject performs the experiment under
each condition
Results are compared within each user
• For user i computexi – yi
• Is mean(xi-yi) > 0 ?
Transfer of learning possible
Less costly and less likely
to sufferfrom user variation
Counterbalancing
Defeats ordering effects by varying order of conditions
systematically(not randomly)
Latin Square designs
• Randomly assign subjects to equal-size groups
• A, B, C, … are the experimental conditions
• Latin Square ensures that each condition occurs in every
position in the ordering for an equal number of users
• Balanced Latin Squares: http://www.yorku.ca/mack/RN-
Counterbalancing.html
G1 G2
A B
B A
G1 G2 G3
A C B
B A C
C B A
G1 G2 G3 G4
A D C B
B A D C
C B A D
D C B A
Kinds of
measures
Self-report
• E.g. satisfaction
Observation
• Visible vs. hidden observer
• Hawthorne effect1
Archivalrecords
• Public vs. private
Trace
• Subjects normally unaware (e.g. testing for book
read wear)
1Hawthorneeffect:thealterationof behaviourby the subjectsof a study due to theirawarenessof being observed.
Evaluation: tests
Query techniques
26
Interviews
Analyst questionsuser on one-to -one basis
usually based on prepared questions
Informal, subjective and relatively cheap
Advantages
• Can be varied to suit context
• Issues can be explored more fully
• Can elicit user views and identify unanticipated
problems
Disadvantages
• Very subjective
• Time consuming
Questionnaires
Set of fixed questions given to users
Advantages
• Quick and reaches large user group
• Can be analysed more rigorously
Disadvantages
• Less flexible
• Less probing
Questionnaires
Need careful design
• What information is required?
• How are answers to be analysed?
Styles of question
• General
• Open-ended
• Scalar
• Multi-choice
• Ranked
Evaluation: tests
Physiological methods
30
Eye tracking
Head or desk mounted equipment tracks eye
position
Eye movement reflects the amount of
cognitive processinga display requires;
measurements include
• fixations: eye maintains stable position. Number
and duration indicate level of difficulty with display
• saccades: rapid eye movement between points of
interest
• scan paths: moving straight to a target with a short
fixation at the target is optimal
Physiological
measurements
Emotional responselinked to physical
changes
These may help determine a user’s reaction
to an interface; measurementsinclude:
• heart activity, including blood pressure, volume and pulse.
• activity of sweat glands: Galvanic Skin Response (GSR)
• electrical activity in muscle: electromyogram (EMG)
• electrical activity in brain: electroencephalogram (EEG)
Some difficulty in interpretingthese
physiologicalresponses- more research
needed
Evaluation: tests
Applicability
33
Choosing an
evaluation
method
Question Decision
When in process design vs. implementation
Style of evaluation laboratory vs. field
Level of objectivity subjective vs. objective
Type of measures qualitative vs. quantitative
Level of information high level vs. low level
Level of interference obtrusive vs. unobtrusive
Available resources
time, subjects,
equipment, expertise

Tests

  • 1.
  • 2.
    Goals of evaluation Assess extentof systemfunctionality Assess effectof an interface on users Identify specific problems
  • 3.
    Evaluation techniques Evaluation • tests usabilityand functionalityof system • occurs in laboratory, field and/or in collaboration with users • evaluates both design and implementation • should be considered at all stages in the design life cycle
  • 4.
  • 5.
  • 6.
    Cognitive walkthrough (task-specific) Based on theconcept that users prefer to learn by doing rather than read manuals Evaluates design on how well it supports users in learning specific tasks Performed by a group that includes the UI designers and developers, led by experts in cognitive psychology The group ‘walks through’ the design to identify potential problems using psychological principles Forms are used to guide the analysis on each step of the tasks
  • 7.
    Heuristic evaluation (holistic) Proposed by Nielsenand Molich Compares the UI design against accepted usabilityprinciples Identify usability criteria (heuristics) Experts check that the design meets the criteria Example heuristics • Consistencyand standards • Match between systemand the real world • Error prevention Heuristic evaluation ‘debugs’ design
  • 8.
    Review-based evaluation (holistic) Expert-based evaluation method Resultsfrom experimental results and empirical evidence, found in the literature to validate design methods Care needed to ensure results are transferable to new design Model-based evaluation Cognitive models used to filter design options • e.g. GOMS1 prediction of user performance Design rationale can also provide useful evaluation information 1GOMS "a set of Goals,a set of Operators,aset of Methods forachievingthegoals, and a set of Selectionsrules for choosing among competingmethods forgoals."
  • 9.
  • 10.
    Empirical methods in HCI Lab experiment •Artificial, highly controlledby experimenter Field study • Occurs in the actual environment people use the UI and with real tasks Survey • Questionnaire, conductedby paper, phone, web, or in person
  • 11.
    Quantifying usability Learnability • Easy (quick)to learn? Efficiency • Fast to use after learning? Errors • Number of errors Satisfaction • Degree of satisfaction reported by users
  • 12.
    Controlled experiment Controlled evaluation ofspecific aspects of interactive behaviour Evaluator chooses hypothesis to be tested Manipulate independent variables • Different placement, font size, input Measure dependent variables • Times, #errors, #tasks done, satisfaction Use statistical analysis to accept or reject the hypothesis • How changes in independent variables affect the dependent variables – are those effects significant?
  • 13.
    Designing the experiment Process Y =f(x) +  independent variables x dependent variables Y unknown/uncontrolled variables 
  • 14.
    Designing the experiment Subjects/users: who– representative, sufficient sample Implementation: real environment, artificial variations Tasks • Real tasks: word processing, e-mail, web browsing • Artificial: users focus on a simple subset of tasks Measuring: how to count time, #clicks, #errors Ordering: of conditions and tasks Hardware: physical conditions of the test, available inputs
  • 15.
    Hypothesis Prediction of outcome •Framed in terms of independent and dependent variables e.g. “error rate will increase as font size decreases” Null hypothesis • States no difference between conditions • The aim is to disprove this e.g. null hypothesis = “no change of error rate with font size”
  • 16.
    A/B Testing Experiment basedon two alternative interfaces • Normally A is the controland B is the variation In Web design, this is normally used to identify improvementsthat can maximise a certain outcome of interest Normally the currentversionof the interface is associated with the null hypothesis
  • 17.
    Concerns Internal validity Are observedresults actually caused by the independent variables? External validity Can observed results be generalised to the world outside the lab? Reliability Will consistent results be obtained by repeating the experiment?
  • 18.
    Threats to Internal Validity Ordering effects •People learn, and people get tired • Randomise or counterbalance ordering Selection effects • Avoid pre-existinggroups (unless the group is an independent variable) • Randomly assign users to independent variables Experimenter bias • Experimenters may prefer an hypothesis to be proven valid • Double blind experiment is quite hard for HCI • Controlprotocol
  • 19.
    Threats to External Validity Population • Drawa random sample from the real target population Ecological • Make lab conditions as realistic as possible Training • Training should mimic how the real interface would be encounteredand learned Task • Tasks for testingshould be based on task analysis
  • 20.
    Threats to Reliability Uncontrolled variation •User differences • Task design • Measurement error Solutions • Eliminate uncontrolled variation Select users by experience Give consistent training Measure dependent variables precisely • Repetition, repetition Many users, many trials Standard deviation of the mean shrinks like the square root of N (i.e. quadrupling #users makes the mean twice as accurate)
  • 21.
    Blocking Divide samples intosubsets that are more homogeneous than the whole set • Example: testing wear rate of different shoe sole material Lots of variation between feet of different people, but the feet on the same person are more homogeneous Apply all conditions within each block • Test material A on one foot, material B on the other Measure difference within block • Wear(A) – Wear(B) Randomise within the block to eliminate validity threats • Randomly put A on left or right foot
  • 22.
    Between- subjects experiment Each subject performsthe experiment under only one condition Results are compared between different groups • Is mean(xi) > mean (yj) ? No transfer of learning More users required Variation can bias results
  • 23.
    Within- subjects experiment Each subject performsthe experiment under each condition Results are compared within each user • For user i computexi – yi • Is mean(xi-yi) > 0 ? Transfer of learning possible Less costly and less likely to sufferfrom user variation
  • 24.
    Counterbalancing Defeats ordering effectsby varying order of conditions systematically(not randomly) Latin Square designs • Randomly assign subjects to equal-size groups • A, B, C, … are the experimental conditions • Latin Square ensures that each condition occurs in every position in the ordering for an equal number of users • Balanced Latin Squares: http://www.yorku.ca/mack/RN- Counterbalancing.html G1 G2 A B B A G1 G2 G3 A C B B A C C B A G1 G2 G3 G4 A D C B B A D C C B A D D C B A
  • 25.
    Kinds of measures Self-report • E.g.satisfaction Observation • Visible vs. hidden observer • Hawthorne effect1 Archivalrecords • Public vs. private Trace • Subjects normally unaware (e.g. testing for book read wear) 1Hawthorneeffect:thealterationof behaviourby the subjectsof a study due to theirawarenessof being observed.
  • 26.
  • 27.
    Interviews Analyst questionsuser onone-to -one basis usually based on prepared questions Informal, subjective and relatively cheap Advantages • Can be varied to suit context • Issues can be explored more fully • Can elicit user views and identify unanticipated problems Disadvantages • Very subjective • Time consuming
  • 28.
    Questionnaires Set of fixedquestions given to users Advantages • Quick and reaches large user group • Can be analysed more rigorously Disadvantages • Less flexible • Less probing
  • 29.
    Questionnaires Need careful design •What information is required? • How are answers to be analysed? Styles of question • General • Open-ended • Scalar • Multi-choice • Ranked
  • 30.
  • 31.
    Eye tracking Head ordesk mounted equipment tracks eye position Eye movement reflects the amount of cognitive processinga display requires; measurements include • fixations: eye maintains stable position. Number and duration indicate level of difficulty with display • saccades: rapid eye movement between points of interest • scan paths: moving straight to a target with a short fixation at the target is optimal
  • 32.
    Physiological measurements Emotional responselinked tophysical changes These may help determine a user’s reaction to an interface; measurementsinclude: • heart activity, including blood pressure, volume and pulse. • activity of sweat glands: Galvanic Skin Response (GSR) • electrical activity in muscle: electromyogram (EMG) • electrical activity in brain: electroencephalogram (EEG) Some difficulty in interpretingthese physiologicalresponses- more research needed
  • 33.
  • 34.
    Choosing an evaluation method Question Decision Whenin process design vs. implementation Style of evaluation laboratory vs. field Level of objectivity subjective vs. objective Type of measures qualitative vs. quantitative Level of information high level vs. low level Level of interference obtrusive vs. unobtrusive Available resources time, subjects, equipment, expertise