Evaluation

evaluation, validation and
empirical methods

Alan Dix
http://www.alandix.com/

evaluation

you’ve designed it, but is it right?

different kinds of evaluation

endless arguments
quantitative vs. qualitative
in the lab vs. in the wild
experts vs. real users (vs UG students!)

really need to
combine methods
quantitative – what is true & qualitative – why
what is appropriate and possible

purpose

Three
Two types of evaluation

purpose stage
formative improve a design development

summative say “this is good” contractual/sales

investigative
investigative gain understanding
gain understanding research
research
/ exploratory

when does it end?

in a world of perpetual beta ...

real use is the ultimate evaluation

logging, bug reporting, etc.
how do people really use the product?
are some features never used?

what varies (and what you choose)

individuals / groups (not only UG students!)
tasks / activities
products / systems
principles / theories
prior knowledge and experience
learning and order effects

which are you trying to find out about?
which are ‘noise’

a little story …

BIG ACM sponsored conference
‘good’ empirical paper
looking at collaborative support for a task X
three pieces of software:
A – domain specific software, synchronous c nc
syn asy
B – generic software, synchronous
domain
C – generic software, asynchronous spec.
A

generic B C

c
experiment syn
c
as
yn

domain
A
spec.

generic B C

reasonable nos. subjects in each condition
quality measures

significant results p<0.05
domain spec. > generic generic domain
spec.
asynchronous > synchronous sync async

conclusion: really want async domain specific

what’s wrong with that?
c nc
syn a sy

interaction effects domain
A
spec. ?
gap is interesting to study
generic B C
not necessarily end up best

more important …
if you blinked at the wrong moment …

NOT independent variables
three different pieces of software generic domain sync async
spec.
like experiment on 3 people!
B < A B < C
say system B was just bad

what went wrong?

borrowed psych method
… but method embodies assumptions
single simple cause, controlled environment

interaction needs ecologically valid exp.
multiple causes, open situations

what to do?
understand assumptions and modify

are five users enough?

one of the myths of usability!
from a study by Nielsen and Landauer (1993)
empirical work, cost–benefit analysis and averages,
many assumptions: simplified model, iterative steps, ...

basic idea: decreasing returns
each extra user gives less new information

really ... it depends
for robust statistics – many many more
for something interesting – one may be enough

points of comparison
measures:
average satisfaction 3.2 on a 5 point scale
time to complete task in range 13.2–27.6 seconds
good or bad?
need a point of comparison
but what?
self, similar system, created or real??
think purpose ...
what constitutes a ‘control’
think!!

do I need statistics?

finding some problem to fix NO
to know
how frequently it occurs
whether most users experience it YES
if you’ve found most problems

statistics

need a course in itself!
experimental design
choosing right test
etc., etc., etc.

a few things ...

statistical significance

stat. sig = likelihood of seeing effect by chance
5% (p <0.05) = 1 in 20 chance
beware many tests and cherry picking!
10 tests means 50:50 chance of seeing p<0.05
not necessarily large effect (i.e. ≠ important)

non-significant = not proven (NOT no effect)
may simply not be sensitive enough
e.g. too few users
to show no (small) effect need other methods
find out about confidence intervals!

statistical power

how likely effect will show up in experiment
more users means more ‘power’
2x senisitivity needs 4x number of users

manipulate it!
more users (but usually many more)
within subject/group (‘cancels’ individual diffs.)
choice of task (particularly good/bad)
add distracter task

types of knowledge

descriptive
explaining what happened

predictive
saying what will happen
cause ⇒effect
where science often ends
synthetic
• synthetic
working out what to do to make make you want happen
– working out what to do to what what you want
happen effect ⇒cause
effect ⇒cause
design and engineering
– design and engineering

generalisation?

can we ever generalise?
every situation is unique, but ...
... to use past experience is to generalise

generalisation ≠ abstraction
cases, descriptive frameworks, etc.

data ≠ generalistion
interpolation – maybe
extrapolation??

generalisation ...

never comes (solely) from data

always comes from the head

requires understanding

mechanism

?
reduction reconstruction
– formal hypothesis testing
+ may be qualitative too
– more scientific precision

wholistic analytic
•– wholistic analytic
field studies, ethnographies
– field studies, ethnographies
+ ‘end to end’ experiments
? ? ? ? ?
+ ‘end to end’ experiments
– more ecological validity
– more ecological validity

validating work

your work

sa evaluation
m
pl
in
• justification g •experiments
evaluation
singularity?
– expert opinion – experiments
different people
– previous research user studies
– user studies
– newdifferent situations
experiments
peer review
– peer review

generative artefacts

artefact

toolkits
devices
interfaces evaluation
singularity to
guidelines to o m
people, situations
methodologies
• justification sa an
• evaluation
plus ... m y
– expert opinion
pl
– experiments
– previous research – e
different designers user studies
– new experiments different briefs – peer review

(pure) evaluation of generative artefacts
is methodologically unsound

validating work

your work

justification evaluation

expert opinion
• justification •experiments
evaluation
previous opinion
– expert research user studies
– experiments
– previous research – user studies
new new experiments
– experiments peer review
– peer review

justification vs. validation

justification evaluation

• different disciplines
– mathematics: proof = justification
– medicine: drug trials = evaluation

• combine them:
– look for weakness in justification
– focus evaluation there

example – scroll arrows ...
Xerox STAR – first commercial GUI
precursor of Mac, Windows, ...
principled design decisions

which direction for scroll arrows?
not obvious: moving document or handle?
=> do a user study!
gap in justification => evaluation
unfortunately ...
Apple got the wrong designs 

Evaluation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Evaluation

Similar to Evaluation (20)

Recently uploaded

Recently uploaded (20)

Evaluation