Published on

Slides for HCICourse.com

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. evaluation, validation andempirical methodsAlan Dixhttp://www.alandix.com/
  2. 2. evaluation you’ve designed it, but is it right?
  3. 3. different kinds of evaluationendless arguments quantitative vs. qualitative in the lab vs. in the wild experts vs. real users (vs UG students!)really need to combine methods quantitative – what is true & qualitative – why what is appropriate and possible
  4. 4. purposeThree Two types of evaluation purpose stage formative improve a design development summative say “this is good” contractual/sales investigative investigative gain understanding gain understanding research research / exploratory
  5. 5. when does it end?in a world of perpetual beta ... real use is the ultimate evaluationlogging, bug reporting, etc.how do people really use the product?are some features never used?
  6. 6. studies and experiments
  7. 7. what varies (and what you choose)individuals / groups (not only UG students!)tasks / activitiesproducts / systemsprinciples / theoriesprior knowledge and experiencelearning and order effects which are you trying to find out about? which are ‘noise’
  8. 8. a little story …BIG ACM sponsored conference‘good’ empirical paperlooking at collaborative support for a task Xthree pieces of software: A – domain specific software, synchronous c nc syn asy B – generic software, synchronous domain C – generic software, asynchronous spec. A generic B C
  9. 9. cexperiment syn c as yn domain A spec. generic B Creasonable nos. subjects in each conditionquality measuressignificant results p<0.05 domain spec. > generic generic domain spec. asynchronous > synchronous sync asyncconclusion: really want async domain specific
  10. 10. what’s wrong with that? c nc syn a syinteraction effects domain A spec. ? gap is interesting to study generic B C not necessarily end up bestmore important … if you blinked at the wrong moment …NOT independent variables three different pieces of software generic domain sync async spec. like experiment on 3 people! B < A B < C say system B was just bad
  11. 11. what went wrong?borrowed psych method … but method embodies assumptions single simple cause, controlled environmentinteraction needs ecologically valid exp. multiple causes, open situationswhat to do? understand assumptions and modify
  12. 12. numbers and statistics
  13. 13. are five users enough?one of the myths of usability! from a study by Nielsen and Landauer (1993) empirical work, cost–benefit analysis and averages, many assumptions: simplified model, iterative steps, ...basic idea: decreasing returns each extra user gives less new informationreally ... it depends for robust statistics – many many more for something interesting – one may be enough
  14. 14. points of comparisonmeasures: average satisfaction 3.2 on a 5 point scale time to complete task in range 13.2–27.6 seconds good or bad?need a point of comparison but what? self, similar system, created or real?? think purpose ...what constitutes a ‘control’ think!!
  15. 15. do I need statistics?finding some problem to fix NOto know how frequently it occurs whether most users experience it YES if you’ve found most problems
  16. 16. statisticsneed a course in itself! experimental design choosing right test etc., etc., etc.a few things ...
  17. 17. statistical significancestat. sig = likelihood of seeing effect by chance 5% (p <0.05) = 1 in 20 chance beware many tests and cherry picking! 10 tests means 50:50 chance of seeing p<0.05 not necessarily large effect (i.e. ≠ important)non-significant = not proven (NOT no effect) may simply not be sensitive enough e.g. too few users to show no (small) effect need other methods find out about confidence intervals!
  18. 18. statistical powerhow likely effect will show up in experiment more users means more ‘power’ 2x senisitivity needs 4x number of usersmanipulate it! more users (but usually many more) within subject/group (‘cancels’ individual diffs.) choice of task (particularly good/bad) add distracter task
  19. 19. from data to knowledge
  20. 20. types of knowledgedescriptive explaining what happenedpredictive saying what will happen cause ⇒effect where science often endssynthetic• syntheticworking out what to do to make make you want happen – working out what to do to what what you want happen effect ⇒cause effect ⇒causedesign and engineering – design and engineering
  21. 21. generalisation?can we ever generalise?every situation is unique, but ... ... to use past experience is to generalisegeneralisation ≠ abstraction cases, descriptive frameworks, etc.data ≠ generalistion interpolation – maybe extrapolation??
  22. 22. generalisation ... never comes (solely) from data always comes from the head requires understanding
  23. 23. mechanism ?reduction reconstruction – formal hypothesis testing + may be qualitative too – more scientific precision wholistic analytic•– wholistic analytic field studies, ethnographies – field studies, ethnographies+ ‘end to end’ experiments ? ? ? ? ? + ‘end to end’ experiments– more ecological validity – more ecological validity
  24. 24. from evaluation to validation
  25. 25. validating work your work sa evaluation m pl in • justification g •experiments evaluation singularity? – expert opinion – experiments different people – previous research user studies – user studies – newdifferent situations experiments peer review – peer review
  26. 26. generative artefacts artefacttoolkitsdevicesinterfaces evaluation singularity toguidelines to o m people, situationsmethodologies • justification sa an • evaluation plus ... m y – expert opinion pl – experiments – previous research – e different designers user studies – new experiments different briefs – peer review (pure) evaluation of generative artefacts is methodologically unsound
  27. 27. validating work your workjustification evaluation expert opinion • justification •experiments evaluation previous opinion – expert research user studies – experiments – previous research – user studies new new experiments – experiments peer review – peer review
  28. 28. justification vs. validation justification evaluation • different disciplines – mathematics: proof = justification – medicine: drug trials = evaluation • combine them: – look for weakness in justification – focus evaluation there
  29. 29. example – scroll arrows ...Xerox STAR – first commercial GUI precursor of Mac, Windows, ... principled design decisionswhich direction for scroll arrows? not obvious: moving document or handle?=> do a user study! gap in justification => evaluationunfortunately ... Apple got the wrong designs 