SlideShare a Scribd company logo
1 of 35
Download to read offline
proScript:
Partially Ordered Scripts Generation
Keisuke Sakaguchi, Chandra Bhagavatula, Ronan Le Bras,

Niket Tandon, Peter Clark, Yejin Choi

What is script? Why is it important?
“a script is a stereotyped sequence of actions that defines a well-known
situation and has associated with it”
Roger Schank and Robert Abelson (1977)
2
What is script? Why is it important?
“a script is a stereotyped sequence of actions that defines a well-known
situation and has associated with it”
Roger Schank and Robert Abelson (1977)
3
What is script? Why is it important?
“a script is a stereotyped sequence of actions that defines a well-known
situation and has associated with it”
Roger Schank and Robert Abelson (1977)
• Part of commonsense knowledge
• Scripts helps to represent and understand causal structure of events.
• Scripts allows inference about implicit cause and effect relationship.
4
Two major approaches for Scripts in NLP
1. Script as narrative chain (Mooney and Dejong 1985, Chambers and Jurafsky, 2008, 2009)
5
Automatically induce scripts from raw texts
An automatically learned “Prosecution” chain.
(Figure from Chambers and Jurafsky, 2008)
Two major approaches for Scripts in NLP
1. Script as narrative chain (Mooney and Dejong 1985, Chambers and Jurafsky, 2008, 2009)
6
Automatically induce scripts from raw texts
An automatically learned “Prosecution” chain.
(Figure from Chambers and Jurafsky, 2008)
[Pros]
•scalability
[Cons]
• news domain (but not everyday scenarios)
• a lot of reporting verbs (non-core events)
• highly abstracted as tuples of verb and the dependency
• evaluation scheme is insufficient
Two major approaches for Scripts in NLP
2. Script as paraphrase sets (Regneri et al., 2010; Modi et al., 2016; Wangzare et al., 2016)
7
1. Ask crowdworkers to write down a sequence of events.
2. The collected sequences are aligned with paraphrased events.
3. Cluster the aligned events.
Multiple sequence alaingment
EATING IN A FAST-FOOD RESTAURANT
(Figure from Regneri et al., 2010)
Two major approaches for Scripts in NLP
2. Script as paraphrase sets (Regneri et al., 2010; Modi et al., 2016; Wangzare et al., 2016)
8
1. Ask crowdworkers to write down a sequence of events.
2. The collected sequences are aligned with paraphrased events.
3. Cluster the aligned events.
Multiple sequence alaingment
EATING IN A FAST-FOOD RESTAURANT
(Figure from Regneri et al., 2010)
[Pros]
•High quality (for everyday scenarios)
[Cons]
• Scalability (< 50)
• No evaluation metric for modeling
Our contributions
Quality Scalability
Script as narrative chain - +
Script as paraphrase sets + -
proScript + +
1. Crowdsourced 6.4k (partially ordered) scripts.
2. With this data, we adapt pre-trained neural LMs to generate high-quality scripts.
3. Proposed two complementary task definitions with proScirpt dataset.
9
Data Collection
Crowdsourcing 11
1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts
Crowdsourcing 12
1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts
ROCStories (Mostafazadeh et al., 2016) → 2,564 scenarios
Manually curate patterns
- want(ed) to ... (e.g., go to Hawaii),
- need(ed) to ... (e.g, get a haircut),
- look(ing) to (e.g, buy a television).
sign into email account, go to a bathroom, buy some new clothes, replace a closet door,
DeScript (Wanzare et al., 2016) → 40 scenarios
take a bath, do laundry, order a pizza, …
VirtualHome (Puig et al., 2018) → 233 scenarios
turn on light, put mail in mail organizer, put dishes away, …
Crowdsourcing 13
1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts
Suppose a scenario where someone wants to
“travel to Hawaii”.
Q1: Describe 5 to 7 essential steps and each time duration. (Note: the order does not matter.)
decide schedule 1 hour
book a flight
go to airport
30 minutes
1 hour
Crowdsourcing 14
1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts
Suppose a scenario where someone wants to
“travel to Hawaii”.
Q1: Describe 5 to 7 essential steps and each time duration. (Note: the order does not matter.)
decide schedule 1 hour
book a flight
go to airport
Q2. Create a flowchart of the steps
(possibly in partial order, where temporal ordering
is required only when it is necessary.)
30 minutes
1 hour
Crowdsourcing 15
1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts
Two different workers are asked to do the Q2.
If both two validator created script graph that have low agreement (F1), it is discarded.
E = Ê =
proScript: Dataset statistics 16
buy some new
clothes (1 hour)
go to bathroom (5 mins)
sign into email
account (1 min)
replace a closet door (1 day)
find a new job
(1 month)
open a small business (1 year)
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
ln m (m=minutes)
0.00
0.05
0.10
0.15
Normalized
Density
Degree>=3
5%
Degree=2
28%
Degree=1
67%
Normalized histogram of time duration Degree of the graphs
Modeling and
Experiments
Two task settings
1. proScript Edge Prediction
18
2. proScript Generation
Two task settings
1. proScript Edge Prediction
19
2. proScript Generation
find the cake recipe
gather the ingredients
turn on the oven
mix the ingredients
put the cake batter in the oven
bake for the right amount of time
take the cake out of the oven
Scenario: bake a cake
Given: Scenario and randomly shuffled events Given: Scenario and the number of events (to generate)
Scenario: bake a cake
Number of events: 7
Two task settings
1. proScript Edge Prediction
20
2. proScript Generation
find the cake recipe
gather the ingredients
turn on the oven
mix the ingredients
put the cake batter in the oven
bake for the right amount of time
take the cake out of the oven
Scenario: bake a cake
Given: Scenario and randomly shuffled events Given: Scenario and the number of events (to generate)
Scenario: bake a cake
Number of events: 7
How to represent DAG?
How to represent a DAG structure? — DOT language. 21
digraph G { A -> B; A -> C; B -> D; C -> D; D -> E; }
=
How to represent a DAG structure? — DOT language. 22
digraph G { A -> B; A -> C; B -> D; C -> D; D -> E; }
digraph G{ Step0: find the cake recipe; Step1: gather the
ingredients; Step2: mix the ingredients; (… omitted …) Step5:
bake for the right amount of time;
Step6: take the cake out of the oven;
Step0 -> Step1; Step0 -> Step3; (… omitted …)
Step5 -> Step6; }
=
=
Two task settings
1. proScript Edge Prediction
23
2. proScript Generation
Two task settings
1. proScript Edge Prediction
24
2. proScript Generation
Models
1. proScript_gen
T5 (11B) finetuning with proScript data (3.2k scenarios)
25
Models
1. proScript_gen
T5 (11B) finetuning with proScript data (3.2k scenarios)
2. proScript_transfer
Pre-finetune with WikiHow data (130k) → finetune with proScript data
26
Model Outputs (examples) by proScript_gen 27
Play the organ Drink a glass of milk Audition for a musical
Model Outputs (examples) by proScript_gen 28
Play the organ Drink a glass of milk Audition for a musical
How to evaluate these?
How to evaluate the generated scripts (DAGs)? 29
Absolute evaluation Relative evaluation
Absolute evaluation: graph edit distance (Abu-Aisheh et al., 2015)
30
Generated DAG Edited DAG
Absolute evaluation: Result (lower GED, the better) 31
proScript_gen
proScript_transfer
Human
0 1.25 2.5 3.75 5
2.33
3.55
3.54
0.46
1.211
1.199
vertex edge
Graph edit distance (random baseline = 11.3)
• Random (11.3) >> proScript_gen = transfer (4.7) > human (2.7)
• Edge-related edits > Vertex-related edits
proScript_gen
proScript_transfer
Edit analysis 32
human error
5%
granularity
32%
order
ambiguity
32%
irrelevant/
redundant event
11%
missing
event
5%
incorrect
order
16%
human error
23%
paraphrase
7%
granularity
27%
order
ambiguity
33%
incorrect
order
10%
30%
10%
Edit Types (Scripts by model) Edit Types (Scripts by human)
• 70% of edits are minor corrections.
• proScript generates more crucial edits than human
Relative evaluation: pairwise comparison 33
proScript_gen vs. Human Gen vs. Transfer
proScript_gen transfer
proScript_gen
Human
VS. VS.
Relative evaluation: pairwise comparison 34
proScript_gen vs. Human Gen vs. Transfer
proScript_gen transfer
proScript_gen
Human
<
=
>
<
=
>
55.3%
22.7%
22.0% 23.8%
45.6%
30.6%
Summary 35
We collect 6.4k partially ordered scripts, proScript,
which is substantially larger than prior datasets.
With proScript, we introduced two complementary tasks
and models. (edge prediction and script generation)
We show the first time that pre-trained neural LM can be
adapted to generate partial-order Scripts.
Data will be available: https://proscript.allenai.org/

More Related Content

Similar to EMNLP 2021 proScript

Social Analytics with MongoDB
Social Analytics with MongoDBSocial Analytics with MongoDB
Social Analytics with MongoDB
Patrick Stokes
 
Madaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysMadaari : Ordering For The Monkeys
Madaari : Ordering For The Monkeys
J On The Beach
 
Functional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event SystemsFunctional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event Systems
Leonardo Borges
 
Kanban for Software Development and Kaizen Culture
Kanban for Software Development and Kaizen CultureKanban for Software Development and Kaizen Culture
Kanban for Software Development and Kaizen Culture
Acquate
 
Threading Is Not A Model
Threading Is Not A ModelThreading Is Not A Model
Threading Is Not A Model
guest2a5acfb
 

Similar to EMNLP 2021 proScript (20)

Social Analytics with MongoDB
Social Analytics with MongoDBSocial Analytics with MongoDB
Social Analytics with MongoDB
 
Philosophies of Building the Workplace
Philosophies of Building the WorkplacePhilosophies of Building the Workplace
Philosophies of Building the Workplace
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Madaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysMadaari : Ordering For The Monkeys
Madaari : Ordering For The Monkeys
 
Functional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event SystemsFunctional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event Systems
 
Kanban for Software Development and Kaizen Culture
Kanban for Software Development and Kaizen CultureKanban for Software Development and Kaizen Culture
Kanban for Software Development and Kaizen Culture
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are Algorithms
 
Mining Branch-Time Scenarios From Execution Logs
Mining Branch-Time Scenarios From Execution LogsMining Branch-Time Scenarios From Execution Logs
Mining Branch-Time Scenarios From Execution Logs
 
Adventures in a Microservice world at REA Group
Adventures in a Microservice world at REA GroupAdventures in a Microservice world at REA Group
Adventures in a Microservice world at REA Group
 
Develop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfDevelop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConf
 
Mining Software Archives to Support Software Development
Mining Software Archives to Support Software DevelopmentMining Software Archives to Support Software Development
Mining Software Archives to Support Software Development
 
Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015
 
Threading Is Not A Model
Threading Is Not A ModelThreading Is Not A Model
Threading Is Not A Model
 
Eyes or heart
Eyes or heartEyes or heart
Eyes or heart
 
Manage a project portfolio
Manage a project portfolioManage a project portfolio
Manage a project portfolio
 
마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)
마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트) 마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)
마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
Testing Vue Apps with Cypress.io (STLJS Meetup April 2018)
Testing Vue Apps with Cypress.io (STLJS Meetup April 2018)Testing Vue Apps with Cypress.io (STLJS Meetup April 2018)
Testing Vue Apps with Cypress.io (STLJS Meetup April 2018)
 
The Ring programming language version 1.6 book - Part 181 of 189
The Ring programming language version 1.6 book - Part 181 of 189The Ring programming language version 1.6 book - Part 181 of 189
The Ring programming language version 1.6 book - Part 181 of 189
 
DRONE: A Tool to Detect and Repair Directive Defects in Java APIs Documentation
DRONE: A Tool to Detect and Repair Directive Defects in Java APIs DocumentationDRONE: A Tool to Detect and Repair Directive Defects in Java APIs Documentation
DRONE: A Tool to Detect and Repair Directive Defects in Java APIs Documentation
 

More from Keisuke Sakaguchi (9)

Acl18 sakaguchi
Acl18 sakaguchiAcl18 sakaguchi
Acl18 sakaguchi
 
Ijcnlp17 sakaguchi
Ijcnlp17 sakaguchiIjcnlp17 sakaguchi
Ijcnlp17 sakaguchi
 
ACL17_Sakaguchi
ACL17_SakaguchiACL17_Sakaguchi
ACL17_Sakaguchi
 
TACL16_Sakaguchi
TACL16_SakaguchiTACL16_Sakaguchi
TACL16_Sakaguchi
 
NAACL15_sakaguchi
NAACL15_sakaguchiNAACL15_sakaguchi
NAACL15_sakaguchi
 
BEA12_sakaguchi
BEA12_sakaguchiBEA12_sakaguchi
BEA12_sakaguchi
 
ACL13_sakaguchi
ACL13_sakaguchiACL13_sakaguchi
ACL13_sakaguchi
 
WMT14_sakaguchi
WMT14_sakaguchiWMT14_sakaguchi
WMT14_sakaguchi
 
COLING12_sakaguchi
COLING12_sakaguchiCOLING12_sakaguchi
COLING12_sakaguchi
 

Recently uploaded

development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 

Recently uploaded (20)

SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 

EMNLP 2021 proScript

  • 1. proScript: Partially Ordered Scripts Generation Keisuke Sakaguchi, Chandra Bhagavatula, Ronan Le Bras,
 Niket Tandon, Peter Clark, Yejin Choi

  • 2. What is script? Why is it important? “a script is a stereotyped sequence of actions that defines a well-known situation and has associated with it” Roger Schank and Robert Abelson (1977) 2
  • 3. What is script? Why is it important? “a script is a stereotyped sequence of actions that defines a well-known situation and has associated with it” Roger Schank and Robert Abelson (1977) 3
  • 4. What is script? Why is it important? “a script is a stereotyped sequence of actions that defines a well-known situation and has associated with it” Roger Schank and Robert Abelson (1977) • Part of commonsense knowledge • Scripts helps to represent and understand causal structure of events. • Scripts allows inference about implicit cause and effect relationship. 4
  • 5. Two major approaches for Scripts in NLP 1. Script as narrative chain (Mooney and Dejong 1985, Chambers and Jurafsky, 2008, 2009) 5 Automatically induce scripts from raw texts An automatically learned “Prosecution” chain. (Figure from Chambers and Jurafsky, 2008)
  • 6. Two major approaches for Scripts in NLP 1. Script as narrative chain (Mooney and Dejong 1985, Chambers and Jurafsky, 2008, 2009) 6 Automatically induce scripts from raw texts An automatically learned “Prosecution” chain. (Figure from Chambers and Jurafsky, 2008) [Pros] •scalability [Cons] • news domain (but not everyday scenarios) • a lot of reporting verbs (non-core events) • highly abstracted as tuples of verb and the dependency • evaluation scheme is insufficient
  • 7. Two major approaches for Scripts in NLP 2. Script as paraphrase sets (Regneri et al., 2010; Modi et al., 2016; Wangzare et al., 2016) 7 1. Ask crowdworkers to write down a sequence of events. 2. The collected sequences are aligned with paraphrased events. 3. Cluster the aligned events. Multiple sequence alaingment EATING IN A FAST-FOOD RESTAURANT (Figure from Regneri et al., 2010)
  • 8. Two major approaches for Scripts in NLP 2. Script as paraphrase sets (Regneri et al., 2010; Modi et al., 2016; Wangzare et al., 2016) 8 1. Ask crowdworkers to write down a sequence of events. 2. The collected sequences are aligned with paraphrased events. 3. Cluster the aligned events. Multiple sequence alaingment EATING IN A FAST-FOOD RESTAURANT (Figure from Regneri et al., 2010) [Pros] •High quality (for everyday scenarios) [Cons] • Scalability (< 50) • No evaluation metric for modeling
  • 9. Our contributions Quality Scalability Script as narrative chain - + Script as paraphrase sets + - proScript + + 1. Crowdsourced 6.4k (partially ordered) scripts. 2. With this data, we adapt pre-trained neural LMs to generate high-quality scripts. 3. Proposed two complementary task definitions with proScirpt dataset. 9
  • 11. Crowdsourcing 11 1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts
  • 12. Crowdsourcing 12 1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts ROCStories (Mostafazadeh et al., 2016) → 2,564 scenarios Manually curate patterns - want(ed) to ... (e.g., go to Hawaii), - need(ed) to ... (e.g, get a haircut), - look(ing) to (e.g, buy a television). sign into email account, go to a bathroom, buy some new clothes, replace a closet door, DeScript (Wanzare et al., 2016) → 40 scenarios take a bath, do laundry, order a pizza, … VirtualHome (Puig et al., 2018) → 233 scenarios turn on light, put mail in mail organizer, put dishes away, …
  • 13. Crowdsourcing 13 1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts Suppose a scenario where someone wants to “travel to Hawaii”. Q1: Describe 5 to 7 essential steps and each time duration. (Note: the order does not matter.) decide schedule 1 hour book a flight go to airport 30 minutes 1 hour
  • 14. Crowdsourcing 14 1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts Suppose a scenario where someone wants to “travel to Hawaii”. Q1: Describe 5 to 7 essential steps and each time duration. (Note: the order does not matter.) decide schedule 1 hour book a flight go to airport Q2. Create a flowchart of the steps (possibly in partial order, where temporal ordering is required only when it is necessary.) 30 minutes 1 hour
  • 15. Crowdsourcing 15 1. Collect scenarios of scripts 2. Create partial order scripts 3.Validate the scripts Two different workers are asked to do the Q2. If both two validator created script graph that have low agreement (F1), it is discarded. E = Ê =
  • 16. proScript: Dataset statistics 16 buy some new clothes (1 hour) go to bathroom (5 mins) sign into email account (1 min) replace a closet door (1 day) find a new job (1 month) open a small business (1 year) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 ln m (m=minutes) 0.00 0.05 0.10 0.15 Normalized Density Degree>=3 5% Degree=2 28% Degree=1 67% Normalized histogram of time duration Degree of the graphs
  • 18. Two task settings 1. proScript Edge Prediction 18 2. proScript Generation
  • 19. Two task settings 1. proScript Edge Prediction 19 2. proScript Generation find the cake recipe gather the ingredients turn on the oven mix the ingredients put the cake batter in the oven bake for the right amount of time take the cake out of the oven Scenario: bake a cake Given: Scenario and randomly shuffled events Given: Scenario and the number of events (to generate) Scenario: bake a cake Number of events: 7
  • 20. Two task settings 1. proScript Edge Prediction 20 2. proScript Generation find the cake recipe gather the ingredients turn on the oven mix the ingredients put the cake batter in the oven bake for the right amount of time take the cake out of the oven Scenario: bake a cake Given: Scenario and randomly shuffled events Given: Scenario and the number of events (to generate) Scenario: bake a cake Number of events: 7 How to represent DAG?
  • 21. How to represent a DAG structure? — DOT language. 21 digraph G { A -> B; A -> C; B -> D; C -> D; D -> E; } =
  • 22. How to represent a DAG structure? — DOT language. 22 digraph G { A -> B; A -> C; B -> D; C -> D; D -> E; } digraph G{ Step0: find the cake recipe; Step1: gather the ingredients; Step2: mix the ingredients; (… omitted …) Step5: bake for the right amount of time; Step6: take the cake out of the oven; Step0 -> Step1; Step0 -> Step3; (… omitted …) Step5 -> Step6; } = =
  • 23. Two task settings 1. proScript Edge Prediction 23 2. proScript Generation
  • 24. Two task settings 1. proScript Edge Prediction 24 2. proScript Generation
  • 25. Models 1. proScript_gen T5 (11B) finetuning with proScript data (3.2k scenarios) 25
  • 26. Models 1. proScript_gen T5 (11B) finetuning with proScript data (3.2k scenarios) 2. proScript_transfer Pre-finetune with WikiHow data (130k) → finetune with proScript data 26
  • 27. Model Outputs (examples) by proScript_gen 27 Play the organ Drink a glass of milk Audition for a musical
  • 28. Model Outputs (examples) by proScript_gen 28 Play the organ Drink a glass of milk Audition for a musical How to evaluate these?
  • 29. How to evaluate the generated scripts (DAGs)? 29 Absolute evaluation Relative evaluation
  • 30. Absolute evaluation: graph edit distance (Abu-Aisheh et al., 2015) 30 Generated DAG Edited DAG
  • 31. Absolute evaluation: Result (lower GED, the better) 31 proScript_gen proScript_transfer Human 0 1.25 2.5 3.75 5 2.33 3.55 3.54 0.46 1.211 1.199 vertex edge Graph edit distance (random baseline = 11.3) • Random (11.3) >> proScript_gen = transfer (4.7) > human (2.7) • Edge-related edits > Vertex-related edits proScript_gen proScript_transfer
  • 32. Edit analysis 32 human error 5% granularity 32% order ambiguity 32% irrelevant/ redundant event 11% missing event 5% incorrect order 16% human error 23% paraphrase 7% granularity 27% order ambiguity 33% incorrect order 10% 30% 10% Edit Types (Scripts by model) Edit Types (Scripts by human) • 70% of edits are minor corrections. • proScript generates more crucial edits than human
  • 33. Relative evaluation: pairwise comparison 33 proScript_gen vs. Human Gen vs. Transfer proScript_gen transfer proScript_gen Human VS. VS.
  • 34. Relative evaluation: pairwise comparison 34 proScript_gen vs. Human Gen vs. Transfer proScript_gen transfer proScript_gen Human < = > < = > 55.3% 22.7% 22.0% 23.8% 45.6% 30.6%
  • 35. Summary 35 We collect 6.4k partially ordered scripts, proScript, which is substantially larger than prior datasets. With proScript, we introduced two complementary tasks and models. (edge prediction and script generation) We show the first time that pre-trained neural LM can be adapted to generate partial-order Scripts. Data will be available: https://proscript.allenai.org/