evaluation you’ve designed it, but is it right?
different kinds of evaluationendless arguments quantitative vs. qualitative in the lab vs. in the wild experts vs. real users (vs UG students!)really need to combine methods quantitative – what is true & qualitative – why what is appropriate and possible
purposeThree Two types of evaluation purpose stage formative improve a design development summative say “this is good” contractual/sales investigative investigative gain understanding gain understanding research research / exploratory
when does it end?in a world of perpetual beta ... real use is the ultimate evaluationlogging, bug reporting, etc.how do people really use the product?are some features never used?
what varies (and what you choose)individuals / groups (not only UG students!)tasks / activitiesproducts / systemsprinciples / theoriesprior knowledge and experiencelearning and order effects which are you trying to find out about? which are ‘noise’
a little story …BIG ACM sponsored conference‘good’ empirical paperlooking at collaborative support for a task Xthree pieces of software: A – domain specific software, synchronous c nc syn asy B – generic software, synchronous domain C – generic software, asynchronous spec. A generic B C
cexperiment syn c as yn domain A spec. generic B Creasonable nos. subjects in each conditionquality measuressignificant results p<0.05 domain spec. > generic generic domain spec. asynchronous > synchronous sync asyncconclusion: really want async domain specific
what’s wrong with that? c nc syn a syinteraction effects domain A spec. ? gap is interesting to study generic B C not necessarily end up bestmore important … if you blinked at the wrong moment …NOT independent variables three different pieces of software generic domain sync async spec. like experiment on 3 people! B < A B < C say system B was just bad
what went wrong?borrowed psych method … but method embodies assumptions single simple cause, controlled environmentinteraction needs ecologically valid exp. multiple causes, open situationswhat to do? understand assumptions and modify
are five users enough?one of the myths of usability! from a study by Nielsen and Landauer (1993) empirical work, cost–benefit analysis and averages, many assumptions: simplified model, iterative steps, ...basic idea: decreasing returns each extra user gives less new informationreally ... it depends for robust statistics – many many more for something interesting – one may be enough
points of comparisonmeasures: average satisfaction 3.2 on a 5 point scale time to complete task in range 13.2–27.6 seconds good or bad?need a point of comparison but what? self, similar system, created or real?? think purpose ...what constitutes a ‘control’ think!!
do I need statistics?finding some problem to fix NOto know how frequently it occurs whether most users experience it YES if you’ve found most problems
statisticsneed a course in itself! experimental design choosing right test etc., etc., etc.a few things ...
statistical significancestat. sig = likelihood of seeing effect by chance 5% (p <0.05) = 1 in 20 chance beware many tests and cherry picking! 10 tests means 50:50 chance of seeing p<0.05 not necessarily large effect (i.e. ≠ important)non-significant = not proven (NOT no effect) may simply not be sensitive enough e.g. too few users to show no (small) effect need other methods find out about confidence intervals!
statistical powerhow likely effect will show up in experiment more users means more ‘power’ 2x senisitivity needs 4x number of usersmanipulate it! more users (but usually many more) within subject/group (‘cancels’ individual diffs.) choice of task (particularly good/bad) add distracter task
types of knowledgedescriptive explaining what happenedpredictive saying what will happen cause ⇒effect where science often endssynthetic• syntheticworking out what to do to make make you want happen – working out what to do to what what you want happen effect ⇒cause effect ⇒causedesign and engineering – design and engineering
generalisation?can we ever generalise?every situation is unique, but ... ... to use past experience is to generalisegeneralisation ≠ abstraction cases, descriptive frameworks, etc.data ≠ generalistion interpolation – maybe extrapolation??
generalisation ... never comes (solely) from data always comes from the head requires understanding
mechanism ?reduction reconstruction – formal hypothesis testing + may be qualitative too – more scientific precision wholistic analytic•– wholistic analytic field studies, ethnographies – field studies, ethnographies+ ‘end to end’ experiments ? ? ? ? ? + ‘end to end’ experiments– more ecological validity – more ecological validity
validating work your work sa evaluation m pl in • justification g •experiments evaluation singularity? – expert opinion – experiments different people – previous research user studies – user studies – newdifferent situations experiments peer review – peer review
generative artefacts artefacttoolkitsdevicesinterfaces evaluation singularity toguidelines to o m people, situationsmethodologies • justification sa an • evaluation plus ... m y – expert opinion pl – experiments – previous research – e different designers user studies – new experiments different briefs – peer review (pure) evaluation of generative artefacts is methodologically unsound
validating work your workjustification evaluation expert opinion • justification •experiments evaluation previous opinion – expert research user studies – experiments – previous research – user studies new new experiments – experiments peer review – peer review
justification vs. validation justification evaluation • different disciplines – mathematics: proof = justification – medicine: drug trials = evaluation • combine them: – look for weakness in justification – focus evaluation there
example – scroll arrows ...Xerox STAR – first commercial GUI precursor of Mac, Windows, ... principled design decisionswhich direction for scroll arrows? not obvious: moving document or handle?=> do a user study! gap in justification => evaluationunfortunately ... Apple got the wrong designs