evaluation, validation and
empirical methods

Alan Dix
http://www.alandix.com/
evaluation

   you’ve designed it, but is it right?
different kinds of evaluation

endless arguments
  quantitative vs. qualitative
  in the lab vs. in the wild
  experts vs. real users (vs UG students!)


really need to
  combine methods
         quantitative – what is true   &   qualitative – why
  what is appropriate and possible
purpose


Three
 Two types of evaluation

                         purpose               stage
   formative        improve a design        development

   summative        say “this is good”    contractual/sales

   investigative
    investigative   gain understanding
                     gain understanding       research
                                              research
    / exploratory
when does it end?

in a world of perpetual beta ...

      real use is the ultimate evaluation

logging, bug reporting, etc.
how do people really use the product?
are some features never used?
studies and experiments
what varies (and what you choose)

individuals / groups (not only UG students!)
tasks / activities
products / systems
principles / theories
prior knowledge and experience
learning and order effects

      which are you trying to find out about?
      which are ‘noise’
a little story …

BIG ACM sponsored conference
‘good’ empirical paper
looking at collaborative support for a task X
three pieces of software:
  A – domain specific software, synchronous          c      nc
                                                  syn    asy
  B – generic software, synchronous
                                        domain
  C – generic software, asynchronous      spec.
                                                  A

                                        generic   B      C
c
experiment                                                  syn
                                                                c
                                                                    as
                                                                      yn

                                                domain
                                                           A
                                                  spec.

                                                generic    B        C



reasonable nos. subjects in each condition
quality measures

significant results p<0.05
  domain spec. > generic       generic domain
                                        spec.
  asynchronous > synchronous                              sync      async



conclusion: really want async domain specific
what’s wrong with that?
                                                              c           nc
                                                          syn         a sy

interaction effects                             domain
                                                          A
                                                  spec.           ?
   gap is interesting to study
                                                generic   B          C
   not necessarily end up best

more important …
 if you blinked at the wrong moment …

NOT independent variables
   three different pieces of software   generic domain        sync       async
                                                 spec.
   like experiment on 3 people!
                                        B < A             B < C
   say system B was just bad
what went wrong?

borrowed psych method
    … but method embodies assumptions
    single simple cause, controlled environment


interaction needs ecologically valid exp.
    multiple causes, open situations


what to do?
    understand assumptions and modify
numbers and statistics
are five users enough?

one of the myths of usability!
    from a study by Nielsen and Landauer (1993)
        empirical work, cost–benefit analysis and averages,
        many assumptions: simplified model, iterative steps, ...

basic idea: decreasing returns
    each extra user gives less new information

really ... it depends
    for robust statistics – many many more
    for something interesting – one may be enough
points of comparison
measures:
  average satisfaction 3.2 on a 5 point scale
  time to complete task in range 13.2–27.6 seconds
  good or bad?
need a point of comparison
  but what?
  self, similar system, created or real??
  think purpose ...
what constitutes a ‘control’
  think!!
do I need statistics?


finding some problem to fix   NO
to know
   how frequently it occurs
   whether most users experience it   YES
   if you’ve found most problems
statistics


need a course in itself!
             experimental design
             choosing right test
             etc., etc., etc.


a few things ...
statistical significance

stat. sig = likelihood of seeing effect by chance
           5% (p <0.05) = 1 in 20 chance
           beware many tests and cherry picking!
           10 tests means 50:50 chance of seeing p<0.05
  not necessarily large effect (i.e. ≠ important)

non-significant = not proven (NOT no effect)
  may simply not be sensitive enough
  e.g. too few users
  to show no (small) effect need other methods
           find out about confidence intervals!
statistical power

how likely effect will show up in experiment
   more users means more ‘power’
            2x senisitivity needs 4x number of users

manipulate it!
   more users (but usually many more)
   within subject/group (‘cancels’ individual diffs.)
   choice of task (particularly good/bad)
   add distracter task
from data to knowledge
types of knowledge

descriptive
   explaining what happened

predictive
   saying what will happen
                    cause ⇒effect
   where science often ends
synthetic
• synthetic
working out what to do to make make you want happen
   – working out what to do to what what you want
     happen         effect ⇒cause
             effect ⇒cause
design and engineering
   – design and engineering
generalisation?

can we ever generalise?
every situation is unique, but ...
     ... to use past experience is to generalise

generalisation ≠ abstraction
           cases, descriptive frameworks, etc.

data ≠ generalistion
           interpolation – maybe
           extrapolation??
generalisation ...

    never comes (solely) from data

     always comes from the head

        requires understanding
mechanism


                                       ?
reduction reconstruction
   – formal hypothesis testing
   + may be qualitative too
   – more scientific precision

 wholistic analytic
•– wholistic analytic
   field studies, ethnographies
   – field studies, ethnographies
+ ‘end to end’ experiments
                                    ? ? ? ? ?
   + ‘end to end’ experiments
– more ecological validity
   – more ecological validity
from evaluation to validation
validating work

                            your work




                     sa                            evaluation
                       m
                         pl
                           in
 •   justification            g         •experiments
                                          evaluation
              singularity?
       – expert opinion                 – experiments
            different people
      – previous research               user studies
                                        – user studies
      – newdifferent situations
             experiments
                                        peer review
                                        – peer review
generative artefacts

                            artefact

toolkits
devices
interfaces                                             evaluation
                          singularity           to
guidelines                                    to o m
                          people, situations
methodologies
 • justification                                 sa an
                                             • evaluation
                          plus ...                 m y
    – expert opinion
                                                     pl
                                             – experiments
    – previous research                      –         e
                          different designers user studies
    – new experiments     different briefs   – peer review


     (pure) evaluation of generative artefacts
           is methodologically unsound
validating work

                           your work




justification                                  evaluation

 expert opinion
 • justification                       •experiments
                                         evaluation
 previous opinion
     – expert research                 user studies
                                       – experiments
     – previous research               – user studies
 new new experiments
     – experiments                     peer review
                                       – peer review
justification vs. validation


    justification                          evaluation



 • different disciplines
    – mathematics: proof = justification
    – medicine: drug trials = evaluation

 • combine them:
    – look for weakness in justification
    – focus evaluation there
example – scroll arrows ...
Xerox STAR – first commercial GUI
      precursor of Mac, Windows, ...
      principled design decisions

which direction for scroll arrows?
      not obvious: moving document or handle?
=> do a user study!
      gap in justification => evaluation
unfortunately ...
      Apple got the wrong designs 

Evaluation

  • 1.
    evaluation, validation and empiricalmethods Alan Dix http://www.alandix.com/
  • 2.
    evaluation you’ve designed it, but is it right?
  • 3.
    different kinds ofevaluation endless arguments quantitative vs. qualitative in the lab vs. in the wild experts vs. real users (vs UG students!) really need to combine methods quantitative – what is true & qualitative – why what is appropriate and possible
  • 4.
    purpose Three Two typesof evaluation purpose stage formative improve a design development summative say “this is good” contractual/sales investigative investigative gain understanding gain understanding research research / exploratory
  • 5.
    when does itend? in a world of perpetual beta ... real use is the ultimate evaluation logging, bug reporting, etc. how do people really use the product? are some features never used?
  • 6.
  • 7.
    what varies (andwhat you choose) individuals / groups (not only UG students!) tasks / activities products / systems principles / theories prior knowledge and experience learning and order effects which are you trying to find out about? which are ‘noise’
  • 8.
    a little story… BIG ACM sponsored conference ‘good’ empirical paper looking at collaborative support for a task X three pieces of software: A – domain specific software, synchronous c nc syn asy B – generic software, synchronous domain C – generic software, asynchronous spec. A generic B C
  • 9.
    c experiment syn c as yn domain A spec. generic B C reasonable nos. subjects in each condition quality measures significant results p<0.05 domain spec. > generic generic domain spec. asynchronous > synchronous sync async conclusion: really want async domain specific
  • 10.
    what’s wrong withthat? c nc syn a sy interaction effects domain A spec. ? gap is interesting to study generic B C not necessarily end up best more important … if you blinked at the wrong moment … NOT independent variables three different pieces of software generic domain sync async spec. like experiment on 3 people! B < A B < C say system B was just bad
  • 11.
    what went wrong? borrowedpsych method … but method embodies assumptions single simple cause, controlled environment interaction needs ecologically valid exp. multiple causes, open situations what to do? understand assumptions and modify
  • 12.
  • 13.
    are five usersenough? one of the myths of usability! from a study by Nielsen and Landauer (1993) empirical work, cost–benefit analysis and averages, many assumptions: simplified model, iterative steps, ... basic idea: decreasing returns each extra user gives less new information really ... it depends for robust statistics – many many more for something interesting – one may be enough
  • 14.
    points of comparison measures: average satisfaction 3.2 on a 5 point scale time to complete task in range 13.2–27.6 seconds good or bad? need a point of comparison but what? self, similar system, created or real?? think purpose ... what constitutes a ‘control’ think!!
  • 15.
    do I needstatistics? finding some problem to fix NO to know how frequently it occurs whether most users experience it YES if you’ve found most problems
  • 16.
    statistics need a coursein itself! experimental design choosing right test etc., etc., etc. a few things ...
  • 17.
    statistical significance stat. sig= likelihood of seeing effect by chance 5% (p <0.05) = 1 in 20 chance beware many tests and cherry picking! 10 tests means 50:50 chance of seeing p<0.05 not necessarily large effect (i.e. ≠ important) non-significant = not proven (NOT no effect) may simply not be sensitive enough e.g. too few users to show no (small) effect need other methods find out about confidence intervals!
  • 18.
    statistical power how likelyeffect will show up in experiment more users means more ‘power’ 2x senisitivity needs 4x number of users manipulate it! more users (but usually many more) within subject/group (‘cancels’ individual diffs.) choice of task (particularly good/bad) add distracter task
  • 19.
    from data toknowledge
  • 20.
    types of knowledge descriptive explaining what happened predictive saying what will happen cause ⇒effect where science often ends synthetic • synthetic working out what to do to make make you want happen – working out what to do to what what you want happen effect ⇒cause effect ⇒cause design and engineering – design and engineering
  • 21.
    generalisation? can we evergeneralise? every situation is unique, but ... ... to use past experience is to generalise generalisation ≠ abstraction cases, descriptive frameworks, etc. data ≠ generalistion interpolation – maybe extrapolation??
  • 22.
    generalisation ... never comes (solely) from data always comes from the head requires understanding
  • 23.
    mechanism ? reduction reconstruction – formal hypothesis testing + may be qualitative too – more scientific precision wholistic analytic •– wholistic analytic field studies, ethnographies – field studies, ethnographies + ‘end to end’ experiments ? ? ? ? ? + ‘end to end’ experiments – more ecological validity – more ecological validity
  • 24.
  • 25.
    validating work your work sa evaluation m pl in • justification g •experiments evaluation singularity? – expert opinion – experiments different people – previous research user studies – user studies – newdifferent situations experiments peer review – peer review
  • 26.
    generative artefacts artefact toolkits devices interfaces evaluation singularity to guidelines to o m people, situations methodologies • justification sa an • evaluation plus ... m y – expert opinion pl – experiments – previous research – e different designers user studies – new experiments different briefs – peer review (pure) evaluation of generative artefacts is methodologically unsound
  • 27.
    validating work your work justification evaluation expert opinion • justification •experiments evaluation previous opinion – expert research user studies – experiments – previous research – user studies new new experiments – experiments peer review – peer review
  • 28.
    justification vs. validation justification evaluation • different disciplines – mathematics: proof = justification – medicine: drug trials = evaluation • combine them: – look for weakness in justification – focus evaluation there
  • 29.
    example – scrollarrows ... Xerox STAR – first commercial GUI precursor of Mac, Windows, ... principled design decisions which direction for scroll arrows? not obvious: moving document or handle? => do a user study! gap in justification => evaluation unfortunately ... Apple got the wrong designs 