Metrics in usability testing and user experiences

Metrics
For Usability (UT) and User Experience (UX)

Learning outcomes
• Students are able to apply the subjective questionnaire for their
usability project.

What is usability metric?
• The measurement of relative
users’ performance on a
given set of test tasks. The
most basic measures are
based on the deﬁnition o
usability as quality metric:
success rate, the erorr rate,
and users’ subjective
satisfaction.

Beneﬁt of UT metrics
• Track progress between releases. You cannot ﬁne-tune your
methodology unless you know how well you're doing.
• Assess your competitive position. Are you better or worse than
other companies? Where are you better or worse?
• Make a Stop/Go decision before launch. Is the design good
enough to release to an unsuspecting world?
• Create bonus plans for design managers and higher-level
executives. For example, you can determine bonus amounts for
development project leaders based on how many customer-
support calls or emails their products generated during the year.

Objective metrics
• The time a task requires
• The error rate
• The success rate

Metric classiﬁcation
1. Task Load/Mental
2. Usability metric
3. User Experience

Task load/mental
• Subjective Mental Effort Questionnaire – SMEQ (Sauro, 2009) with 1
item i.e., measuring task difﬁculty.
• NASA’s task load index – NASA-TLX (1980) with 6 items such as
mental demand, physical demand, temporal demand, performance,
effort and frustration.

1. SMEQ - Subjective Mental Effort
Questionnaire
• Post-task rating of difﬁculty in
usability test
• Measure user satisfaction
immediately after the event,
usually the completion of a
task, potentially increasing its
validity.
• The question will be used for
tasks. i.e., 7 tasks - 7 set of
questions.

Scales
• The more scale steps in a questionnaire
item the better, but with rapidly diminishing
returns.
• 2 to 20, there is an initial rapid increase in
reliability, but it tends to level off at about 7
steps.
• After 11 steps there is little gain in reliability
from increasing the number of steps. The
number of steps is important for single-item
assessments, but is usually less important
when summing scores over a number of
items.
• Attitude scales tend to be highly reliable
because the items typically correlate rather
highly with one another.

2. NASA’s task load index (1980)
• TLX is a subject workload assessment tool to
allow users to perform subjective workload
assessments on operator’s working with
various human-machine interface system.
• Overall workload score based on a weighted
average of rating on six sub scales: Mental
demand, physical demand, temporal
demand, performance , effort and frustration.

Deﬁnition
1. Mental Demand (low/high)
• How much mental and perceptual activity was required
(i.e.,thinking, deciding, remembering, looking,
searching)?
2. Physical Demand (low/high)
• How much physical activity was required (for example,
pushing, pulling, turning, controlling, activating)?

Deﬁnition
3. Temporal Demand (low/high)
• How much time pressure did you feel due
to the rate or pace at which the tasks or
task elements occurred? WAs the pace
slow and leisurely or rapid and frantic?
4. Performance
• How successful do you think you were in
accomplishing the goals of the task set
by the experimenter?
5. Effort
• How hard did you have to work(mentally and
physically to accomplish your level of
performance?
6. Frustration level
• How insecure, discouraged, irritated, stress and
annoyed versus secure, gratiﬁed, content,
relaxed, and complement did you feed during
the task?

2. Usability Metric
• System usability scale (SUS) developed by John Brooke at Digital
Equipment Corporation in the UK in 1986 as a tool to be used in
usability engineering of electronic ofﬁce systems.
• Deﬁned by ISO 9241 Part 11 - context of use of the system.
• The scale is 0-100. It can be used to compare even systems that
are outwardly dissimilar.

SUS
Strongly disagree Strongly agree
1 2 3 4 5
1. I think that I would like to use this system frequently.
2. I found the system unnecessarily complex.
3. I thought the system was easy to use.
4. I think that I would need the support of a technical person
to be able to use this system.
5. I found the various functions in this system were well
integrated.
6. I thought there was too much inconsistency in this
system.
7. I would imagine that most people would learn to use this
system very quickly
8. I found the system very cumbersome to use.
9. I felt very confident using the system.
10. I needed to learn a lot of things before I could get going
with this system.
Learnability
Learnability
Efficiency
Efficiency
Efficiency
Satisfaction
Efficiency
Efficiency

4. After-Scenario Questionnaire (ASQ)
• Lewis, 2002
• Psychometric evaluation
• Measuring user satisfaction with linkert scale

ASQ
Strongly disagree Strongly agree
1 2 3 4 5
1. Overall, I am satisfied with the ease of completing the
tasks in this scenario.
2. Overall, I am satisfied with the amount of time it took to
complete the tasks in this scenario.
3. Overall, I am satisfied with the support information (on-
line help, messages, documentation) when completing the
tasks?

5.Net Promoter (NPS)
• The percentage of customers rating their likelihood to recommend a
company, a product, or a service to a friend or colleague as 9 or 10.
• Those who respond with a score of 9 to 10 are called Promoters.
They are considered likely to exhibit value - creating behaviors such
as buying more, remaining customers for longer and making more
positive referrals to other potential customers to be less likely
customers.
• Detractors are believed to be less likely to exhibit the value-
creating behaviors. The score is 0-6.

NPS
• How likely is that you would recommend our company/product/
service to a friend or colleague?
1 2 3 4 5 6 7 8 9 10

6. Technology/Acceptance Model -
TAM (1986)
• Perceived usefulness (PU) - the degree to which a person believes
that using a particular system would enhance his or her job
performance". It means whether or not someone perceives that
technology to be useful for what they want to do.
• Perceived ease-of-use (PEOU) - "the degree to which a person
believes that using a particular system would be free from
effort" (Davis 1989). If the technology is easy to use, then the
barriers conquered. If it's not easy to use and the interface is
complicated, no one has a positive attitudes towards it.

TAM and TRA model
• TAM posits that our beliefs
about ease and usefulness
affect our attitude toward
using which in turn affects
our intention and actual
use.
Perceived
usefulness
(U)
Perceived
Ease of use
(E)
Attitude
toward using
(A)
Behavioral
intention to
use (BI)
Actual
system use
External
Variables
Beliefs and
evaluation
Normative belief
and motivation to
comply
Subjective
Norm
Attitude toward
behavior
(A)
Behavioral
intention
(B)
Actual
Behavior

Laugwitz et al, 2008 with 6 items (26 variables)
• Efﬁciency: I should perform my tasks with the product fast, efﬁcient
and in a pragmatic way.
• Perspicuity: The product should be easy to understand, clear,
simple, and easy to learn.
• Dependability : The interaction with the product should be
predictable, secure and meets my expectations.
UT
7.

8. User Experience Questionnaire -
UEQ
• Stimulation: Using the product should be interesting, exiting
and motivating.
• Attractiveness: The product should look attractive, enjoyable,
friendly and pleasant.
• Novelty: The product should be innovative, inventive and
creatively designed.
UX
Invention

How to use the Excel-Tool?
• Enter the data in the corresponding work sheet in the Excel
UEQ_Data_Analysis_Tool_Version<x>.xlsx and then all relevant
computations (with the exception of signiﬁcance tests
• To compare two products, here you need to use the Excel
UEQ_Compare_Products_Version<x>.xlsx) are done automatically.

How to interpret the data?
Error Bar
• The error bar describes the interval in which 95% of the scale means of
these repetitions will be located. Thus, it shows how accurate your
measurement is.
• The size of the error bar depends on the sample size (the more
participants you have, the smaller is typically the error bar)
• Error bar shows how much the different participants agree (the higher the
level of agreement, i.e. the more similar the answers are, the smaller is
the error bar

How to interpret the data?
Cronbach-Alpha values
• A measure for the consistency of a scale, i.e. it indicates that all
items in a scale measure a similar construct
• Rules of thumb consider values >0.6 or >0.7 as a sufﬁcient level.
• Alpha-Coefﬁceint is quite sensitive to sampling effects. A low Alpha
value can be the result of a sampling effect and may not necessarily
indicate a problem with scale consistency.

9. Software Usability Measurement
Inventory - SUMI (1990)
• Measuring users’ satisfaction (user experience)
• SUMI deified user experience with work-based software products in
1995.
• SUMI uses a rigorous scientific method of analysis and is backed up
by over 25 years of industrial application.
• SUMI is a copy right license - students need to apply by filling the
on-line form http://sumi.uxp.ie/about/appform.php

Software Usability Measurement - SUMI
• Set veriﬁable goals of user experience
• Track achievement of targets during product development
• Highlight good and bad aspects of an interface
• http://sumi.uxp.ie

How many respondents are required?
• A minimum of 20. A few is 12 respondents.
• Get as many respondents as you can within your timeframe and
budget.

What is measurement variables?
• Efﬁciency - users do their tasks in a quick, effective and economical manner.
• Affect - user’s general emotional reaction to the software
• Helpfulness - software communicates in a helpful way and assists in the
resolution of operational problem.
• Control - an expected and consistent way to inputs and commands.
• Learnability - familiar with the software. The tutorial interface are relabel and
instructive.
Link of questionnaire -> http://sumi.uxp.ie/en/

SUMI items by percentile
Item: 20 prefer to stick to the functions that I know best.
Percentile: 88 Verdict: More Agreement
—————
Item: 12
Working with this software is satisfying.
Percentile: 58 Verdict: No difference
——————
Item: 8
I ﬁnd that the help information given by this software is
not very useful.
Percentile: 39 Verdict: More Disagreement
The 60th percentile indicate that your respondents gave a
more positive response to that item than expected from the
standardisation database. These items are given in a black
colour.
Items between the 60th and 40th percentiles indicate that
the responses your respondents gave are pretty much in
line with the standardisation database: no surprises here.
These items are given in a blue colour.
Items below the 40th percentile indicate that your
respondents gave a more negative response to that item
than expected from the standardisation database. These
items are given in a red colour. To interpret them, say to
yourself "Respondents agree it is NOT true that [item
wording]."

User Records
Participant Global Efficiency Affect Helpfulness Control Learnability
1 70 67 59 65 74 69
2 67 67 58 65 68 69
3 66 62 54 64 69 55
4 65 58 57 73 51 49
5 64 62 54 51 61 57
6 60 64 58 55 64 51
7 59 65 64 53 69 68
8 56 65 63 53 63 68
9 53 67 64 43 62 69
10 52 61 58 41 69 70
11 47 61 58 41 56 63
12 47 60 54 45 44 63
13 46 62 63 43 52 68
14 43 43 45 43 59 49
15 38 55 59 34 49 60
Participants are
arranged in the order
of their Global
scores, the highest
Global scores at the
top of the table.

10. PrEmo : Measure Consumer
Emotion & Product Experience
A unique, scientiﬁcally validated tool to instantly get insight in
consumer emotions! People can report their emotions with the use of
expressive cartoon animations instead of relying on the use of words.
https://www.premotool.com

PrEmo intro and app
• https://youtu.be/yT2iciPYI0U
• https://youtu.be/6pu09rTehjs

11. Trust in Automated system
• Trust can affect how much people accept and rely upon increasingly
automatedsystems (Sheridan, 1988).
• General trust - trustworthy, honesty, loyalty, reliability, honor
• Trust between people - trustworthy, honesty, loyalty, reliability, integrity
• Trust between Human and Automated system - trustworthy, loyalty,
reliability, honor.
Sources : Jiuan-Yin Jian

Trust
1 2 3 4 5 6 7
1.The system is deceptive.
2. The system behaves in an underhanded manner.
3. I am suspicious of the system’s intent, action or output.
4. I am wary of the system.
5. The system’s actions will have a harmful or injurious outcome.
6. I am conﬁdent in the system.
7. The system provides security.
8. The system has integrity.
9. The system is dependable.
10. The system is reliable.
11. I can trust the system.
12. I am familiar with the system.

Summary
• 11 Questionnaires are in this presentation.
• Development of metrics from task (Human Factor) to performance
(Usability testing) and users’ emotion (User experience)
• There are so many questionnaire in the market. As a result, the
validity of questionnaires is the crucial part.

Metrics in usability testing and user experiences

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Metrics in usability testing and user experiences

Similar to Metrics in usability testing and user experiences (20)

More from Him Chitchat

More from Him Chitchat (9)

Recently uploaded

Recently uploaded (20)

Metrics in usability testing and user experiences