Formal Arguments, Preferences, and Natural Language Interfaces to Humans: an Empirical Evaluation

Formal Arguments, Preferences,
and Natural Language Interfaces
to Humans: an Empirical
Evaluation
Federico Cerutti Nava Tintarev Nir Oren
ECAI 2014 — Friday 22nd
August, 2014

Motivation
– Distributed autonomous systems increasingly used
– Reasoning can be formalized as argumentation
– However, if we need to explain this to people the information
presentation needs to be more natural
– Can we create a bridge between natural language and formal
argumentation?
– What kind of factors need to be considered
- Preferences between arguments?
- Domain speciﬁc knowledge?
2 of 31

Background
The Experiment
Methodology
Results
Conclusions
3 of 31

Background on P&S
Rule-based argumentation framework
Allows to express arguments in favour of preferences among rules
Includes negation as failure an strong negation
Although it is pre-Dung1995, it is easy to draw a correspondence with
an abstract argumentation frameworks (there are some points where
we should be cautious, but it is not the case of this work)
4 of 31

Crash course on P&S
Each rule as a set of antecedents and a consequent
Strict (they cannot contain negation as failure atoms) and defeasible
rules
Arguments as sequence (instead of recursive structure like in ASPIC)
of rules
The conclusions of an argument is the set containing each consequent
of each rule of the argument
Attacks:
on some antecedent of some rule
on some conclusion
Skeptical semantics: grounded
Credulous semantics: stable
5 of 31

Example
S D
s1 : ⇒ sAAA
s2 : ⇒ sBBB
s3 : ⇒ sdoc
r1 : sAAA ∧ ∼ exAAA ⇒ poorer
r2 : sBBB ∧ sdoc ∧ ∼ exBBB ∧ ∼ exdoc ⇒ ¬ poorer
r3 : ∼ exexpert ⇒ r1 r2
A politician and an economist discuss the potential financial outcome of the
independence of a region X. The politician puts forward an argument in favour of
the conclusion “If Region X becomes independent, X’s citizens will be poorer
than they are now”. Another argument holding a contradicting conclusion (i.e.
Region X will not be poorer) is advanced by the economist. The economist’s
opinion is likely to be preferred to that of the politician, and is supported by a
scientific document.
rgs = {a1 = 〈s1,r1〉,a2 = 〈s2,s3,r2〉,a3 = 〈r3〉}; a2 rgs-defeats a1
a2 justified
6 of 31

Background
The Experiment
Methodology
Results
Conclusions
7 of 31

The Experiment
Presenting each participant with a text, written in natural language,
followed by a questionnaire
Between subjects design across eight texts: each participant is shown a
single (randomly selected) text
Four domains:
1 weather forecast
2 political debate
3 used car sale
4 romantic relationship
Two KBs: base case, and extended case
The base case always consider two arguments a1 and a2 with two
contradicting conclusions; and a preference in favour of a2
8 of 31

The Extended Case for the Example
More recent research disputes the claim of the economist
S D
s1 : ⇒ sAAA
s2 : ⇒ sBBB
s3 : ⇒ sdoc
s4 : ⇒ sresearch
s5 : sresearch ⇒ ¬sdoc
r1 : sAAA ∧ ∼ exAAA ⇒ poorer
r2 : sBBB ∧ sdoc ∧ ∼ exBBB ∧ ∼ exdoc ⇒ ¬ poorer
r3 : ∼ exexpert ⇒ r1 r2
rgs = {a1 = 〈s1,r1〉,a2 = 〈s2,s3,r2〉,a3 = 〈r3〉,a4 = 〈s4,s5〉}
a2 rgs-defeats a1,a2 rgs-defeats a4,a4 rgs-defeats a2,
Two stable extensions:
{a1,a3,a4} and {a2,a3}
9 of 31

Domain 1: weather forecast
The weather forecasting service of the broadcasting company AAA says
that it will rain tomorrow (a1).
Meanwhile, the forecast service of the broadcasting company BBB says that
it will be cloudy tomorrow but that it will not rain (a2).
It is also well known that the forecasting service of BBB is more accurate
than the one of AAA (a3).
However, yesterday the trustworthy newspaper CCC published an article
which said that BBB has cut the resources for its weather forecasting
service in the past months, thus making it less reliable than in the past (a4).
10 of 31

Domain 2: political debate
In a TV debate, the politician AAA argues that if Region X becomes
independent then X’s citizens will be poorer than now (a1).
Subsequently, financial expert (a3) Dr. BBB presents a document; which
scientifically shows that Region X will not be worse off financially if it
becomes independent (a2).
After that, the moderator of the debate reminds BBB of more recent
research by several important economists that disputes the claims in that
document (a4).
11 of 31

Domain 3: buying a car
You are planning to buy a second-hand car, and you go to a dealership with
BBB, a mechanic whom has been recommended you by a friend (a3).
The salesperson AAA shows you a car and says that it needs very little
work done to it (a1).
BBB says it will require quite a lot of work, because in the past he had to
ﬁx several issues in a car of the same model (a2).
While you are at the dealership, your friend calls you to tell you that he
knows (beyond a shadow of a doubt) that BBB made unnecessary repairs
to his car last month (a4).
12 of 31

Domain 4: romance
After several dates, you would like to start a serious relationship with J.
but you turn to ask two friends of yours, AAA and BBB, for advice. You
have known BBB for longer than you have known AAA (a3).
AAA tells you that J is lovely and you should go ahead (a1),
while BBB suggests that you should be very cautious because J might have
a hidden agenda (a2).
After some weeks, CCC, who is also a close friend of BBB, tells you that
BBB has been into you for years; BBB is too shy to tell you about their
feelings about you, but are still possessive of you (a4).
13 of 31

Formalisation summary
Domain Base Case Extended
Case
Type of reinstatement
1, weather 1.B 1.E preference attack
2, politics 2.B 2.E a2 rebuttal
3, buying car 3.B 3.E preference attack
4, romance 4.B 4.E preference rebuttal
14 of 31

Background
The Experiment
Methodology
Results
Conclusions
15 of 31

Methodology
Participants are asked to determine which of the following positions
they think is accurate:
A: I think that AAA’s position is correct (e.g. “X’s citizens will be
poorer than now”)
B: I think that BBB’s position is correct (e.g. “X’s citizens will not be
worse off ﬁnancially”)
U: I cannot determine if either AAA’s or BBB’s position is correct
(e.g. “I cannot conclude anything about Region X’s ﬁnances”)
Rate a statements in terms of relevance (for the conclusion) and
agreement on a 7 points scale from Disagree to Agree for each
statement
16 of 31

Hypotheses
H1: In the base cases (Scenarios 1.B, 2.B, 3.B and 4.B), the majority of
participants will agree with BBB’s statement (position B)
H2: In the extended cases (Scenarios 1.E, 2.E, 3.E and 4.E), the
majority of participants will agree that they cannot conclude
anything from the text (position U).
H3: The majority of participants who view a base case scenario will
agree with the preference argument, and ﬁnd it relevant
17 of 31

Background
The Experiment
Methodology
Results
Conclusions
18 of 31

Hypotheses H1 and H2
0
15
30
45
60
A B U
%
Distribution of acceptability of actors’ positions
Base cases Extended cases
Distribution of the ﬁnal conclusion A/ B/ U
Base cases, χ 2
analysis (2, N=77)=37.74, p < 0.001;
extended cases χ 2
(2, N=84)=8.0, p < 0.02
19 of 31

Hypothesis H3
Participants rate how much (on a scale of 1 to 7) they agree with the
following statement (agreement), and whether it is relevant in drawing
their conclusion (relevance): “BBB is more trustworthy than AAA.”
Signiﬁcant difference between the base and the extended cases for
agreement (Mann-Whitney U(1778), Z = −5.0, p < 0.001) and relevance
(Mann-Whitney U(1852), Z = −4.7, p < 0.001).
In addition, the median values both for agreement and relevance are
greater for the base cases than for the extended cases
20 of 31

Post Hoc: Motivations
Base Cases Extended Cases
A B U A B U
1, weather 5.0 50.0 45.0 15.8 21.1 63.2
2, politics 5.3 63.2 31.6 21.1 10.5 68.4
3, buying car 0.0 68.2 31.8 23.8 23.8 52.4
4, romance 12.5 68.8 18.8 48.0 36.0 16.0
Distribution of the ﬁnal conclusion A/ B/ U
Fisher (N = 161) = 48.756, p < 0.001, 10000 sampled tables, Monte Carlo
approach with 99% conﬁdence interval (MC99)
21 of 31

Post Hoc: Distributions of Base Cases
0
15
30
45
60
U1 U2 U3
%
Distributions of motivations for U (scenarios 1.B and 3.B)
1.B 3.B
Agreement with the U position in scenarios 1.B and 3.B:
U1: lack of information, U2: domain speciﬁc reasons; U3: other
22 of 31

Post Hoc: Distributions between Base/Extended
Cases
A B U A B U
1, weather 5.0 50.0 45.0 15.8 21.1 63.2
2, politics 5.3 63.2 31.6 21.1 10.5 68.4
3, buying car 0.0 68.2 31.8 23.8 23.8 52.4
4, romance 12.5 68.8 18.8 48.0 36.0 16.0
Are the distributions of choices (among A, B, and U) in the base case
is signiﬁcantly different from the distribution of choices in the
corresponding extended case?
YES for the third domain (3.B and 3.E, buying a car) — Fisher
(N = 43) = 10.693, p < 0.001, 10000 sampled tables, MC99.
NO for the ﬁrst domain (1.B and 1.E, weather forecasts) — Fisher
(N = 39) = 3.832, p = 0.187, 10000 sampled tables, MC99.
23 of 31

Post Hoc: Distributions Extended Cases
A B U A B U
1, weather 5.0 50.0 45.0 15.8 21.1 63.2
2, politics 5.3 63.2 31.6 21.1 10.5 68.4
3, buying car 0.0 68.2 31.8 23.8 23.8 52.4
4, romance 12.5 68.8 18.8 48.0 36.0 16.0
Domain has a signiﬁcant effect on the distribution of positions — Fisher
(N = 84) = 16.308, p < 0.05, 10000 sampled tables, MC99.
24 of 31

Post Hoc: Relevance and Agreement
Base cases Extended cases
RB
†
Md∗
B
RE
†
Md∗
E
C.D.‡
Relevance
1, weather 110.38 6.00 82.92 4.00 46.60
2, politics 107.45 6.00 69.45 4.00 47.19
3, buying car 118.05 6.50 67.45 4.00 44.38
4, romance 48.34 2.00 44.40 2.00 46.57
Agreement
1, weather 116.38 6.00 87.18 4.00 46.60
2, politics 103.34 6.00 65.05 4.00 47.19
3, buying car 121.93 6.50 64.33 4.00 44.38
4, romance 44.94 2.00 44.20 2.00 46.57
Statistically signiﬁcant cases when |Rx − Ry| > C.D.
†
Mean rank as computed with the Kruskal-Wallis test
‡
Critical Difference, as computed in [Siegel and Castellan Jr., 1988] cited
by [Field, 2009] with α = 0.05.
25 of 31

Post Hoc: Relevance and Agreement
Scenario 3.B Scenario 4.B
R3.B
†
Md∗
3.B
R4.B
†
Md∗
4.B
C.D.‡
Relevance 118.05 6.50 48.34 2.00 47.79
Agreement 121.93 6.50 44.94 2.00 47.79
Statistically signiﬁcant cases when |Rx − Ry| > C.D.
†
Mean rank as computed with the Kruskal-Wallis test
‡
Critical Difference, as computed in [Siegel and Castellan Jr., 1988] cited
by [Field, 2009] with α = 0.05.
26 of 31

Background
The Experiment
Methodology
Results
Conclusions
27 of 31

Conclusions
Investigation into the relationship between formal systems of
defeasible argumentation and arguments in natural language
Results suggest a correspondence between the formal theory and its
representation in natural language
Preference generally applied “following” Prakken and Sartor:
importance of being able to represent them
Humans evaluate preference depending on the context
Collateral knowledge
Reverse of preference
28 of 31

Acknowledgement
Research was sponsored by US Army Research laboratory and the UK Ministry
of Defence and was accomplished under Agreement Number W911NF-06-3-0001.
The views and conclusions contained in this document are those of the authors
and should not be interpreted as representing the ofﬁcial policies, either expressed
or implied, of the US Army Research Laboratory, the U.S. Government, the UK
Ministry of Defense, or the UK Government. The US and UK Governments are
authorized to reproduce and distribute reprints for Government purposes
notwithstanding any copyright notation hereon.
This research has been carried out within the project “Scrutable Autonomous
Systems” (SAsSY), funded by the Engineering and Physical Sciences Research
Council (EPSRC, UK), grant ref. EP/J012084/1.
29 of 31

References I
[Field, 2009] Field, A. (2009).
Discovering Statistics Using SPSS (Introducing Statistical Methods series).
SAGE Publications Ltd.
[Siegel and Castellan Jr., 1988] Siegel, S. and Castellan Jr., N. J. (1988).
Nonparametric Statistics for The Behavioral Sciences.
McGraw-Hill Humanities/Social Sciences/Languages.
31 of 31

Formal Arguments, Preferences, and Natural Language Interfaces to Humans: an Empirical Evaluation

More Related Content

Viewers also liked

Similar to Formal Arguments, Preferences, and Natural Language Interfaces to Humans: an Empirical Evaluation

More from Federico Cerutti

Recently uploaded

Formal Arguments, Preferences, and Natural Language Interfaces to Humans: an Empirical Evaluation