The document presents proScript, a new dataset of 6,400 partially ordered scripts crowdsourced by the authors. It introduces two tasks for modeling scripts: edge prediction and script generation. The authors adapt a pretrained T5 model for both tasks, showing it can generate partial-order scripts. Evaluation shows the model outputs are comparable to human scripts based on graph edit distance and pairwise comparisons. The proScript dataset will be made publicly available to advance research on modeling script knowledge.
2. What is script? Why is it important?
“a script is a stereotyped sequence of actions that defines a well-known
situation and has associated with it”
Roger Schank and Robert Abelson (1977)
2
3. What is script? Why is it important?
“a script is a stereotyped sequence of actions that defines a well-known
situation and has associated with it”
Roger Schank and Robert Abelson (1977)
3
4. What is script? Why is it important?
“a script is a stereotyped sequence of actions that defines a well-known
situation and has associated with it”
Roger Schank and Robert Abelson (1977)
• Part of commonsense knowledge
• Scripts helps to represent and understand causal structure of events.
• Scripts allows inference about implicit cause and effect relationship.
4
5. Two major approaches for Scripts in NLP
1. Script as narrative chain (Mooney and Dejong 1985, Chambers and Jurafsky, 2008, 2009)
5
Automatically induce scripts from raw texts
An automatically learned “Prosecution” chain.
(Figure from Chambers and Jurafsky, 2008)
6. Two major approaches for Scripts in NLP
1. Script as narrative chain (Mooney and Dejong 1985, Chambers and Jurafsky, 2008, 2009)
6
Automatically induce scripts from raw texts
An automatically learned “Prosecution” chain.
(Figure from Chambers and Jurafsky, 2008)
[Pros]
•scalability
[Cons]
• news domain (but not everyday scenarios)
• a lot of reporting verbs (non-core events)
• highly abstracted as tuples of verb and the dependency
• evaluation scheme is insufficient
7. Two major approaches for Scripts in NLP
2. Script as paraphrase sets (Regneri et al., 2010; Modi et al., 2016; Wangzare et al., 2016)
7
1. Ask crowdworkers to write down a sequence of events.
2. The collected sequences are aligned with paraphrased events.
3. Cluster the aligned events.
Multiple sequence alaingment
EATING IN A FAST-FOOD RESTAURANT
(Figure from Regneri et al., 2010)
8. Two major approaches for Scripts in NLP
2. Script as paraphrase sets (Regneri et al., 2010; Modi et al., 2016; Wangzare et al., 2016)
8
1. Ask crowdworkers to write down a sequence of events.
2. The collected sequences are aligned with paraphrased events.
3. Cluster the aligned events.
Multiple sequence alaingment
EATING IN A FAST-FOOD RESTAURANT
(Figure from Regneri et al., 2010)
[Pros]
•High quality (for everyday scenarios)
[Cons]
• Scalability (< 50)
• No evaluation metric for modeling
9. Our contributions
Quality Scalability
Script as narrative chain - +
Script as paraphrase sets + -
proScript + +
1. Crowdsourced 6.4k (partially ordered) scripts.
2. With this data, we adapt pre-trained neural LMs to generate high-quality scripts.
3. Proposed two complementary task definitions with proScirpt dataset.
9
12. Crowdsourcing 12
1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts
ROCStories (Mostafazadeh et al., 2016) → 2,564 scenarios
Manually curate patterns
- want(ed) to ... (e.g., go to Hawaii),
- need(ed) to ... (e.g, get a haircut),
- look(ing) to (e.g, buy a television).
sign into email account, go to a bathroom, buy some new clothes, replace a closet door,
DeScript (Wanzare et al., 2016) → 40 scenarios
take a bath, do laundry, order a pizza, …
VirtualHome (Puig et al., 2018) → 233 scenarios
turn on light, put mail in mail organizer, put dishes away, …
13. Crowdsourcing 13
1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts
Suppose a scenario where someone wants to
“travel to Hawaii”.
Q1: Describe 5 to 7 essential steps and each time duration. (Note: the order does not matter.)
decide schedule 1 hour
book a flight
go to airport
30 minutes
1 hour
14. Crowdsourcing 14
1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts
Suppose a scenario where someone wants to
“travel to Hawaii”.
Q1: Describe 5 to 7 essential steps and each time duration. (Note: the order does not matter.)
decide schedule 1 hour
book a flight
go to airport
Q2. Create a flowchart of the steps
(possibly in partial order, where temporal ordering
is required only when it is necessary.)
30 minutes
1 hour
15. Crowdsourcing 15
1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts
Two different workers are asked to do the Q2.
If both two validator created script graph that have low agreement (F1), it is discarded.
E = Ê =
16. proScript: Dataset statistics 16
buy some new
clothes (1 hour)
go to bathroom (5 mins)
sign into email
account (1 min)
replace a closet door (1 day)
find a new job
(1 month)
open a small business (1 year)
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
ln m (m=minutes)
0.00
0.05
0.10
0.15
Normalized
Density
Degree>=3
5%
Degree=2
28%
Degree=1
67%
Normalized histogram of time duration Degree of the graphs
19. Two task settings
1. proScript Edge Prediction
19
2. proScript Generation
find the cake recipe
gather the ingredients
turn on the oven
mix the ingredients
put the cake batter in the oven
bake for the right amount of time
take the cake out of the oven
Scenario: bake a cake
Given: Scenario and randomly shuffled events Given: Scenario and the number of events (to generate)
Scenario: bake a cake
Number of events: 7
20. Two task settings
1. proScript Edge Prediction
20
2. proScript Generation
find the cake recipe
gather the ingredients
turn on the oven
mix the ingredients
put the cake batter in the oven
bake for the right amount of time
take the cake out of the oven
Scenario: bake a cake
Given: Scenario and randomly shuffled events Given: Scenario and the number of events (to generate)
Scenario: bake a cake
Number of events: 7
How to represent DAG?
21. How to represent a DAG structure? — DOT language. 21
digraph G { A -> B; A -> C; B -> D; C -> D; D -> E; }
=
22. How to represent a DAG structure? — DOT language. 22
digraph G { A -> B; A -> C; B -> D; C -> D; D -> E; }
digraph G{ Step0: find the cake recipe; Step1: gather the
ingredients; Step2: mix the ingredients; (… omitted …) Step5:
bake for the right amount of time;
Step6: take the cake out of the oven;
Step0 -> Step1; Step0 -> Step3; (… omitted …)
Step5 -> Step6; }
=
=
26. Models
1. proScript_gen
T5 (11B) finetuning with proScript data (3.2k scenarios)
2. proScript_transfer
Pre-finetune with WikiHow data (130k) → finetune with proScript data
26
27. Model Outputs (examples) by proScript_gen 27
Play the organ Drink a glass of milk Audition for a musical
28. Model Outputs (examples) by proScript_gen 28
Play the organ Drink a glass of milk Audition for a musical
How to evaluate these?
29. How to evaluate the generated scripts (DAGs)? 29
Absolute evaluation Relative evaluation
31. Absolute evaluation: Result (lower GED, the better) 31
proScript_gen
proScript_transfer
Human
0 1.25 2.5 3.75 5
2.33
3.55
3.54
0.46
1.211
1.199
vertex edge
Graph edit distance (random baseline = 11.3)
• Random (11.3) >> proScript_gen = transfer (4.7) > human (2.7)
• Edge-related edits > Vertex-related edits
proScript_gen
proScript_transfer
32. Edit analysis 32
human error
5%
granularity
32%
order
ambiguity
32%
irrelevant/
redundant event
11%
missing
event
5%
incorrect
order
16%
human error
23%
paraphrase
7%
granularity
27%
order
ambiguity
33%
incorrect
order
10%
30%
10%
Edit Types (Scripts by model) Edit Types (Scripts by human)
• 70% of edits are minor corrections.
• proScript generates more crucial edits than human
33. Relative evaluation: pairwise comparison 33
proScript_gen vs. Human Gen vs. Transfer
proScript_gen transfer
proScript_gen
Human
VS. VS.
34. Relative evaluation: pairwise comparison 34
proScript_gen vs. Human Gen vs. Transfer
proScript_gen transfer
proScript_gen
Human
<
=
>
<
=
>
55.3%
22.7%
22.0% 23.8%
45.6%
30.6%
35. Summary 35
We collect 6.4k partially ordered scripts, proScript,
which is substantially larger than prior datasets.
With proScript, we introduced two complementary tasks
and models. (edge prediction and script generation)
We show the first time that pre-trained neural LM can be
adapted to generate partial-order Scripts.
Data will be available: https://proscript.allenai.org/