2. evaluation
you’ve designed it, but is it right?
3. different kinds of evaluation
endless arguments
quantitative vs. qualitative
in the lab vs. in the wild
experts vs. real users (vs UG students!)
really need to
combine methods
quantitative – what is true & qualitative – why
what is appropriate and possible
4. purpose
Three
Two types of evaluation
purpose stage
formative improve a design development
summative say “this is good” contractual/sales
investigative
investigative gain understanding
gain understanding research
research
/ exploratory
5. when does it end?
in a world of perpetual beta ...
real use is the ultimate evaluation
logging, bug reporting, etc.
how do people really use the product?
are some features never used?
7. what varies (and what you choose)
individuals / groups (not only UG students!)
tasks / activities
products / systems
principles / theories
prior knowledge and experience
learning and order effects
which are you trying to find out about?
which are ‘noise’
8. a little story …
BIG ACM sponsored conference
‘good’ empirical paper
looking at collaborative support for a task X
three pieces of software:
A – domain specific software, synchronous c nc
syn asy
B – generic software, synchronous
domain
C – generic software, asynchronous spec.
A
generic B C
9. c
experiment syn
c
as
yn
domain
A
spec.
generic B C
reasonable nos. subjects in each condition
quality measures
significant results p<0.05
domain spec. > generic generic domain
spec.
asynchronous > synchronous sync async
conclusion: really want async domain specific
10. what’s wrong with that?
c nc
syn a sy
interaction effects domain
A
spec. ?
gap is interesting to study
generic B C
not necessarily end up best
more important …
if you blinked at the wrong moment …
NOT independent variables
three different pieces of software generic domain sync async
spec.
like experiment on 3 people!
B < A B < C
say system B was just bad
11. what went wrong?
borrowed psych method
… but method embodies assumptions
single simple cause, controlled environment
interaction needs ecologically valid exp.
multiple causes, open situations
what to do?
understand assumptions and modify
13. are five users enough?
one of the myths of usability!
from a study by Nielsen and Landauer (1993)
empirical work, cost–benefit analysis and averages,
many assumptions: simplified model, iterative steps, ...
basic idea: decreasing returns
each extra user gives less new information
really ... it depends
for robust statistics – many many more
for something interesting – one may be enough
14. points of comparison
measures:
average satisfaction 3.2 on a 5 point scale
time to complete task in range 13.2–27.6 seconds
good or bad?
need a point of comparison
but what?
self, similar system, created or real??
think purpose ...
what constitutes a ‘control’
think!!
15. do I need statistics?
finding some problem to fix NO
to know
how frequently it occurs
whether most users experience it YES
if you’ve found most problems
16. statistics
need a course in itself!
experimental design
choosing right test
etc., etc., etc.
a few things ...
17. statistical significance
stat. sig = likelihood of seeing effect by chance
5% (p <0.05) = 1 in 20 chance
beware many tests and cherry picking!
10 tests means 50:50 chance of seeing p<0.05
not necessarily large effect (i.e. ≠ important)
non-significant = not proven (NOT no effect)
may simply not be sensitive enough
e.g. too few users
to show no (small) effect need other methods
find out about confidence intervals!
18. statistical power
how likely effect will show up in experiment
more users means more ‘power’
2x senisitivity needs 4x number of users
manipulate it!
more users (but usually many more)
within subject/group (‘cancels’ individual diffs.)
choice of task (particularly good/bad)
add distracter task
20. types of knowledge
descriptive
explaining what happened
predictive
saying what will happen
cause ⇒effect
where science often ends
synthetic
• synthetic
working out what to do to make make you want happen
– working out what to do to what what you want
happen effect ⇒cause
effect ⇒cause
design and engineering
– design and engineering
21. generalisation?
can we ever generalise?
every situation is unique, but ...
... to use past experience is to generalise
generalisation ≠ abstraction
cases, descriptive frameworks, etc.
data ≠ generalistion
interpolation – maybe
extrapolation??
22. generalisation ...
never comes (solely) from data
always comes from the head
requires understanding
23. mechanism
?
reduction reconstruction
– formal hypothesis testing
+ may be qualitative too
– more scientific precision
wholistic analytic
•– wholistic analytic
field studies, ethnographies
– field studies, ethnographies
+ ‘end to end’ experiments
? ? ? ? ?
+ ‘end to end’ experiments
– more ecological validity
– more ecological validity
25. validating work
your work
sa evaluation
m
pl
in
• justification g •experiments
evaluation
singularity?
– expert opinion – experiments
different people
– previous research user studies
– user studies
– newdifferent situations
experiments
peer review
– peer review
26. generative artefacts
artefact
toolkits
devices
interfaces evaluation
singularity to
guidelines to o m
people, situations
methodologies
• justification sa an
• evaluation
plus ... m y
– expert opinion
pl
– experiments
– previous research – e
different designers user studies
– new experiments different briefs – peer review
(pure) evaluation of generative artefacts
is methodologically unsound
27. validating work
your work
justification evaluation
expert opinion
• justification •experiments
evaluation
previous opinion
– expert research user studies
– experiments
– previous research – user studies
new new experiments
– experiments peer review
– peer review
28. justification vs. validation
justification evaluation
• different disciplines
– mathematics: proof = justification
– medicine: drug trials = evaluation
• combine them:
– look for weakness in justification
– focus evaluation there
29. example – scroll arrows ...
Xerox STAR – first commercial GUI
precursor of Mac, Windows, ...
principled design decisions
which direction for scroll arrows?
not obvious: moving document or handle?
=> do a user study!
gap in justification => evaluation
unfortunately ...
Apple got the wrong designs