Tests

Tests
Carlos Matos
Lecture 8
PC3001, PC4001, PC5001, CS2847

Goals of
evaluation
Assess extent of systemfunctionality
Assess effectof an interface on users
Identify specific problems

Evaluation
techniques
Evaluation
• tests usability and functionalityof system
• occurs in laboratory, field and/or in collaboration
with users
• evaluates both design and implementation
• should be considered at all stages in the design life
cycle

Evaluation: tests
Evaluating designs

Evaluating
designs
Cognitive walkthrough(task-specific)
Heuristic evaluation (holistic)
Review-based evaluation (holistic)

Cognitive
walkthrough
(task-specific)
Based on the concept that users prefer to learn
by doing rather than read manuals
Evaluates design on how well it supports users
in learning specific tasks
Performed by a group that includes the UI
designers and developers, led by experts in
cognitive psychology
The group ‘walks through’ the design to identify
potential problems using psychological
principles
Forms are used to guide the analysis on each
step of the tasks

Heuristic
evaluation
(holistic)
Proposed by Nielsen and Molich
Compares the UI design against
accepted usabilityprinciples
Identify usability criteria (heuristics)
Experts check that the design meets the
criteria
Example heuristics
• Consistencyand standards
• Match between systemand the real world
• Error prevention
Heuristic evaluation ‘debugs’ design

Review-based
evaluation
(holistic)
Expert-based evaluation method
Results from experimental results and
empirical evidence, found in the literature to
validate design methods
Care needed to ensure results are
transferable to new design
Model-based evaluation
Cognitive models used to filter design
options
• e.g. GOMS1 prediction of user performance
Design rationale can also provide useful
evaluation information
1GOMS
"a set of Goals,a set of Operators,aset of Methods forachievingthegoals, and a set of Selectionsrules for choosing among competingmethods forgoals."

Evaluation: Tests
Evaluating Implementations
9

Empirical
methods in
HCI
Lab experiment
• Artificial, highly controlledby experimenter
Field study
• Occurs in the actual environment people use the UI
and with real tasks
Survey
• Questionnaire, conductedby paper, phone, web, or
in person

Quantifying
usability
Learnability
• Easy (quick) to learn?
Efficiency
• Fast to use after learning?
Errors
• Number of errors
Satisfaction
• Degree of satisfaction reported by users

Controlled
experiment
Controlled evaluation of specific aspects of
interactive behaviour
Evaluator chooses hypothesis to be tested
Manipulate independent variables
• Different placement, font size, input
Measure dependent variables
• Times, #errors, #tasks done, satisfaction
Use statistical analysis to accept or reject the
hypothesis
• How changes in independent variables affect the
dependent variables – are those effects significant?

Designing the
experiment
Process
Y = f(x) + 
independent
variables
x
dependent
variables
Y
unknown/uncontrolled
variables


Designing the
experiment
Subjects/users: who – representative, sufficient
sample
Implementation: real environment, artificial
variations
Tasks
• Real tasks: word processing, e-mail, web browsing
• Artificial: users focus on a simple subset of tasks
Measuring: how to count time, #clicks, #errors
Ordering: of conditions and tasks
Hardware: physical conditions of the test,
available inputs

Hypothesis
Prediction of outcome
• Framed in terms of independent and dependent
variables
e.g. “error rate will increase as font size decreases”
Null hypothesis
• States no difference between conditions
• The aim is to disprove this
e.g. null hypothesis = “no change of error rate with
font size”

A/B Testing
Experiment based on two alternative
interfaces
• Normally A is the controland B is the variation
In Web design, this is normally used to
identify improvementsthat can maximise a
certain outcome of interest
Normally the currentversionof the interface
is associated with the null hypothesis

Concerns
Internal validity
Are observed results actually caused by the
independent variables?
External validity
Can observed results be generalised to the world
outside the lab?
Reliability
Will consistent results be obtained by repeating
the experiment?

Threats to
Internal
Validity
Ordering effects
• People learn, and people get tired
• Randomise or counterbalance ordering
Selection effects
• Avoid pre-existinggroups (unless the group is an
independent variable)
• Randomly assign users to independent variables
Experimenter bias
• Experimenters may prefer an hypothesis to be
proven valid
• Double blind experiment is quite hard for HCI
• Controlprotocol

Threats to
External
Validity
Population
• Draw a random sample from the real target
population
Ecological
• Make lab conditions as realistic as possible
Training
• Training should mimic how the real interface would
be encounteredand learned
Task
• Tasks for testingshould be based on task analysis

Threats to
Reliability
Uncontrolled variation
• User differences
• Task design
• Measurement error
Solutions
• Eliminate uncontrolled variation
Select users by experience
Give consistent training
Measure dependent variables precisely
• Repetition, repetition
Many users, many trials
Standard deviation of the mean shrinks like the square
root of N (i.e. quadrupling #users makes the mean
twice as accurate)

Blocking
Divide samples into subsets that are more
homogeneous than the whole set
• Example: testing wear rate of different shoe sole
material
Lots of variation between feet of different people, but
the feet on the same person are more homogeneous
Apply all conditions within each block
• Test material A on one foot, material B on the other
Measure difference within block
• Wear(A) – Wear(B)
Randomise within the block to eliminate validity
threats
• Randomly put A on left or right foot

Between-
subjects
experiment
Each subject performs the experiment under
only one condition
Results are compared between different
groups
• Is mean(xi) > mean (yj) ?
No transfer of learning
More users required
Variation can bias results

Within-
subjects
experiment
Each subject performs the experiment under
each condition
Results are compared within each user
• For user i computexi – yi
• Is mean(xi-yi) > 0 ?
Transfer of learning possible
Less costly and less likely
to sufferfrom user variation

Counterbalancing
Defeats ordering effects by varying order of conditions
systematically(not randomly)
Latin Square designs
• Randomly assign subjects to equal-size groups
• A, B, C, … are the experimental conditions
• Latin Square ensures that each condition occurs in every
position in the ordering for an equal number of users
• Balanced Latin Squares: http://www.yorku.ca/mack/RN-
Counterbalancing.html
G1 G2
A B
B A
G1 G2 G3
A C B
B A C
C B A
G1 G2 G3 G4
A D C B
B A D C
C B A D
D C B A

Kinds of
measures
Self-report
• E.g. satisfaction
Observation
• Visible vs. hidden observer
• Hawthorne effect1
Archivalrecords
• Public vs. private
Trace
• Subjects normally unaware (e.g. testing for book
read wear)
1Hawthorneeffect:thealterationof behaviourby the subjectsof a study due to theirawarenessof being observed.

Evaluation: tests
Query techniques
26

Interviews
Analyst questionsuser on one-to -one basis
usually based on prepared questions
Informal, subjective and relatively cheap
Advantages
• Can be varied to suit context
• Issues can be explored more fully
• Can elicit user views and identify unanticipated
problems
Disadvantages
• Very subjective
• Time consuming

Questionnaires
Set of fixed questions given to users
Advantages
• Quick and reaches large user group
• Can be analysed more rigorously
Disadvantages
• Less flexible
• Less probing

Questionnaires
Need careful design
• What information is required?
• How are answers to be analysed?
Styles of question
• General
• Open-ended
• Scalar
• Multi-choice
• Ranked

Evaluation: tests
Physiological methods
30

Eye tracking
Head or desk mounted equipment tracks eye
position
Eye movement reflects the amount of
cognitive processinga display requires;
measurements include
• fixations: eye maintains stable position. Number
and duration indicate level of difficulty with display
• saccades: rapid eye movement between points of
interest
• scan paths: moving straight to a target with a short
fixation at the target is optimal

Physiological
measurements
Emotional responselinked to physical
changes
These may help determine a user’s reaction
to an interface; measurementsinclude:
• heart activity, including blood pressure, volume and pulse.
• activity of sweat glands: Galvanic Skin Response (GSR)
• electrical activity in muscle: electromyogram (EMG)
• electrical activity in brain: electroencephalogram (EEG)
Some difficulty in interpretingthese
physiologicalresponses- more research
needed

Evaluation: tests
Applicability
33

Choosing an
evaluation
method
Question Decision
When in process design vs. implementation
Style of evaluation laboratory vs. field
Level of objectivity subjective vs. objective
Type of measures qualitative vs. quantitative
Level of information high level vs. low level
Level of interference obtrusive vs. unobtrusive
Available resources
time, subjects,
equipment, expertise

Tests

Recommended

Recommended

More Related Content

Similar to Tests

Similar to Tests (20)

Recently uploaded

Recently uploaded (20)

Tests