EMNLP 2021 proScript

proScript:
Partially Ordered Scripts Generation
Keisuke Sakaguchi, Chandra Bhagavatula, Ronan Le Bras, 
Niket Tandon, Peter Clark, Yejin Choi

What is script? Why is it important?
“a script is a stereotyped sequence of actions that defines a well-known
situation and has associated with it”
Roger Schank and Robert Abelson (1977)
2

3

• Part of commonsense knowledge
• Scripts helps to represent and understand causal structure of events.
• Scripts allows inference about implicit cause and effect relationship.
4

Two major approaches for Scripts in NLP
1. Script as narrative chain (Mooney and Dejong 1985, Chambers and Jurafsky, 2008, 2009)
5
Automatically induce scripts from raw texts
An automatically learned “Prosecution” chain.
(Figure from Chambers and Jurafsky, 2008)

1. Script as narrative chain (Mooney and Dejong 1985, Chambers and Jurafsky, 2008, 2009)
6
Automatically induce scripts from raw texts
An automatically learned “Prosecution” chain.
(Figure from Chambers and Jurafsky, 2008)
[Pros]
•scalability
[Cons]
• news domain (but not everyday scenarios)
• a lot of reporting verbs (non-core events)
• highly abstracted as tuples of verb and the dependency
• evaluation scheme is insufficient

2. Script as paraphrase sets (Regneri et al., 2010; Modi et al., 2016; Wangzare et al., 2016)
7
1. Ask crowdworkers to write down a sequence of events.
2. The collected sequences are aligned with paraphrased events.
3. Cluster the aligned events.
Multiple sequence alaingment
EATING IN A FAST-FOOD RESTAURANT
(Figure from Regneri et al., 2010)

2. Script as paraphrase sets (Regneri et al., 2010; Modi et al., 2016; Wangzare et al., 2016)
8
1. Ask crowdworkers to write down a sequence of events.
2. The collected sequences are aligned with paraphrased events.
3. Cluster the aligned events.
Multiple sequence alaingment
EATING IN A FAST-FOOD RESTAURANT
(Figure from Regneri et al., 2010)
[Pros]
•High quality (for everyday scenarios)
[Cons]
• Scalability (< 50)
• No evaluation metric for modeling

Our contributions
Quality Scalability
Script as narrative chain - +
Script as paraphrase sets + -
proScript + +
1. Crowdsourced 6.4k (partially ordered) scripts.
2. With this data, we adapt pre-trained neural LMs to generate high-quality scripts.
3. Proposed two complementary task definitions with proScirpt dataset.
9

Crowdsourcing 11
1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts

Crowdsourcing 12
ROCStories (Mostafazadeh et al., 2016) → 2,564 scenarios
Manually curate patterns
- want(ed) to ... (e.g., go to Hawaii),
- need(ed) to ... (e.g, get a haircut),
- look(ing) to (e.g, buy a television).
sign into email account, go to a bathroom, buy some new clothes, replace a closet door,
DeScript (Wanzare et al., 2016) → 40 scenarios
take a bath, do laundry, order a pizza, …
VirtualHome (Puig et al., 2018) → 233 scenarios
turn on light, put mail in mail organizer, put dishes away, …

Crowdsourcing 13
Suppose a scenario where someone wants to
“travel to Hawaii”.
Q1: Describe 5 to 7 essential steps and each time duration. (Note: the order does not matter.)
decide schedule 1 hour
book a flight
go to airport
30 minutes
1 hour

Crowdsourcing 14
Suppose a scenario where someone wants to
“travel to Hawaii”.
Q1: Describe 5 to 7 essential steps and each time duration. (Note: the order does not matter.)
decide schedule 1 hour
book a flight
go to airport
Q2. Create a flowchart of the steps
(possibly in partial order, where temporal ordering
is required only when it is necessary.)
30 minutes
1 hour

Crowdsourcing 15
Two different workers are asked to do the Q2.
If both two validator created script graph that have low agreement (F1), it is discarded.
E = Ê =

proScript: Dataset statistics 16
buy some new
clothes (1 hour)
go to bathroom (5 mins)
sign into email
account (1 min)
replace a closet door (1 day)
find a new job
(1 month)
open a small business (1 year)
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
ln m (m=minutes)
0.00
0.05
0.10
0.15
Normalized
Density
Degree>=3
5%
Degree=2
28%
Degree=1
67%
Normalized histogram of time duration Degree of the graphs

Two task settings
1. proScript Edge Prediction
18
2. proScript Generation

Two task settings
19
find the cake recipe
gather the ingredients
turn on the oven
mix the ingredients
put the cake batter in the oven
bake for the right amount of time
take the cake out of the oven
Scenario: bake a cake
Given: Scenario and randomly shuffled events Given: Scenario and the number of events (to generate)
Number of events: 7

Two task settings
20
find the cake recipe
gather the ingredients
turn on the oven
mix the ingredients
put the cake batter in the oven
bake for the right amount of time
take the cake out of the oven
Given: Scenario and randomly shuffled events Given: Scenario and the number of events (to generate)
Number of events: 7
How to represent DAG?

How to represent a DAG structure? — DOT language. 21
digraph G { A -> B; A -> C; B -> D; C -> D; D -> E; }
=

How to represent a DAG structure? — DOT language. 22
digraph G { A -> B; A -> C; B -> D; C -> D; D -> E; }
digraph G{ Step0: find the cake recipe; Step1: gather the
ingredients; Step2: mix the ingredients; (… omitted …) Step5:
bake for the right amount of time;
Step6: take the cake out of the oven;
Step0 -> Step1; Step0 -> Step3; (… omitted …)
Step5 -> Step6; }
=
=

Two task settings
23

Two task settings
24

Models
1. proScript_gen
T5 (11B) finetuning with proScript data (3.2k scenarios)
25

Models
1. proScript_gen
T5 (11B) finetuning with proScript data (3.2k scenarios)
2. proScript_transfer
Pre-finetune with WikiHow data (130k) → finetune with proScript data
26

Model Outputs (examples) by proScript_gen 27
Play the organ Drink a glass of milk Audition for a musical

Model Outputs (examples) by proScript_gen 28
Play the organ Drink a glass of milk Audition for a musical
How to evaluate these?

How to evaluate the generated scripts (DAGs)? 29
Absolute evaluation Relative evaluation

Absolute evaluation: graph edit distance (Abu-Aisheh et al., 2015)
30
Generated DAG Edited DAG

Absolute evaluation: Result (lower GED, the better) 31
proScript_gen
proScript_transfer
Human
0 1.25 2.5 3.75 5
2.33
3.55
3.54
0.46
1.211
1.199
vertex edge
Graph edit distance (random baseline = 11.3)
• Random (11.3) >> proScript_gen = transfer (4.7) > human (2.7)
• Edge-related edits > Vertex-related edits
proScript_gen
proScript_transfer

Edit analysis 32
human error
5%
granularity
32%
order
ambiguity
32%
irrelevant/
redundant event
11%
missing
event
5%
incorrect
order
16%
human error
23%
paraphrase
7%
granularity
27%
order
ambiguity
33%
incorrect
order
10%
30%
10%
Edit Types (Scripts by model) Edit Types (Scripts by human)
• 70% of edits are minor corrections.
• proScript generates more crucial edits than human

Relative evaluation: pairwise comparison 33
proScript_gen vs. Human Gen vs. Transfer
proScript_gen transfer
proScript_gen
Human
VS. VS.

Relative evaluation: pairwise comparison 34
proScript_gen vs. Human Gen vs. Transfer
proScript_gen transfer
proScript_gen
Human
<
=
>
<
=
>
55.3%
22.7%
22.0% 23.8%
45.6%
30.6%

Summary 35
We collect 6.4k partially ordered scripts, proScript,
which is substantially larger than prior datasets.
With proScript, we introduced two complementary tasks
and models. (edge prediction and script generation)
We show the first time that pre-trained neural LM can be
adapted to generate partial-order Scripts.
Data will be available: https://proscript.allenai.org/

EMNLP 2021 proScript

Recommended

Recommended

More Related Content

Similar to EMNLP 2021 proScript

Similar to EMNLP 2021 proScript (20)

More from Keisuke Sakaguchi

More from Keisuke Sakaguchi (9)

Recently uploaded

Recently uploaded (20)

EMNLP 2021 proScript