The Eighth Dialog System Technology Challenge (DSTC8)
Seokhwan Kim, Michel Galley, Chulaka Gunasekara, Sungjin Lee, Adam Atkinson, Baolin Peng, Hannes Schulz,
Jianfeng Gao, Jinchao Li, Mahmoud Adada, Minlie Huang, Luis Lastras, Jonathan K. Kummerfeld, Walter S. Lasecki,
Chiori Hori, Anoop Cherian, Tim K. Marks, Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta
Nov 2018 –
Jan 2019
7 Proposals Received
Jan 2019
DSTC8 Planning
@ DSTC7 Workshop
Mar 2019 4 Tracks Selected
Apr - May, 2019 Track Preparation
Jun - Oct, 2019 Challenge Period
Feb 8, 2020
DSTC8 Workshop
@ AAAI-20
• DSTC8 Timeline • Four Main Tracks
• Multi-domain Task Completion
(Microsoft Research AI & Tsinghua University)
• NOESIS II: Predicting Responses
Identifying Success, and Managing
Complexity in Task-Oriented Dialogue
(IBM Research AI & University of Michigan)
• Audio Visual Scene-Aware Dialog
(Mitsubishi Electric Research Laboratories)
• Schema-Guided State Tracking
(Google Research)
• Workshop@AAAI-20
• One-day Workshop
• Feb 8, 2020 in New York
• Online registration will be
open until Jan 10, 2020.
• DSTC9 Planning Session
• Call for DSTC9 Track
Proposals by Jan 25, 2020
• Proposal Presentations @
DSTC8 Workshop
Track 3: Audio Visual Scene-Aware Dialog Track 4: Schema-Guided Dialogue State Tracking
Track 1: Multi-Domain Task-Completion
• Task #1: End-to-End Multi-Domain Task
• Goal: Train an end-to-end dialog system that takes natural language as
input, and output natural language as response
• Data: MultiWOZ 2.0 from Budzianowski et al. (EMNLP 2018)
• 7 domains: attraction, hospital, police, hotel, restaurant, taxi, and train.
• Additional annotation on the user act is also provided.
• Evaluation: E2E performance with both automatic & human evaluation
• Results
• 12 Submissions; a conventional pipeline approach or an end-to-end approach.
• Some differences between human and automatic evaluation.
• The winning team (by human evaluation) leverages GPT-2.
• Task #2: Fast Adaptation Task
• Goal: Train an end-to-end dialog system to adapt to a new goal-oriented
domain given very few in-domain sample dialogs
• Data
• Large Reddit dataset of 5 million dialogs over 1000 domains (subreddits).
• MetaLWOz dataset with over 38,000 goal-oriented dialogs, covering 47 diverse
domains divided into 226 tasks.
• Evaluation
• Automatic Evaluation on MultiWOZ
• Human Evaluation on held-out MetaLWOz-domains
• Results
• Four submissions; transfer learning or fine-tuning with LSTM and Transformer
• Significant difference between automatic and human evaluation.
• Human evaluation ordering robust under bootstrapping.
• Winning team (by human evaluation) used hybrid GPT-2 generator and ranker.
Track 2: NOESIS II: Predicting Responses
• Task: Dialogue generation via next utterance
• Input: Multi-Party Conversation Context + 100 Candidate Utterances
• Output: Rankings including ‘none’
• Data in two domains: Ubuntu support & student advising.
• Evaluation Metrics
• Recall@N: In what fraction of cases is the true answer in the top N
produced by the system?
• Mean Reciprocal Rank:
Ubuntu data from
Kummerfeld et. al., (ACL 2019)
Advising data from DSTC7
Almost all teams used
BERT based approaches
• Results (Ubuntu) • Results (Advising)
Data augmentation
methods and balancing
positive and negative
samples were a crucial
part of the best
approaches.
DSTC mailing list: list@dstc.communityhttps://sites.google.com/dstc.community/dstc8
• Task
• Building machines having a conversation with humans about the objects
and events around them to interact with the real world through
understanding dynamic audiovisual scenes.
• Input: a video, dialog history about the video, a follow-up question
• Output: Generate a correct response to the follow-up question
• Data
• Video Content: Charades Dataset [Sigurdsson et al. 2016]
• Dialog Collection: AVSD dialogs collected via MTurk
• Two turkers have a dialog, consisting of 10 rounds of Q&A.
• Then the Questioner writes a summary of the video.
• Results
• Received 27 system submission from 12 teams
• The best system applied "Universal Multimodal Tansformer: Fine tuned
seq-to-seq model with GPT-2 embedding".
• Task
• Develop dialogue state tracking systems suitable for scale and
complexity of large-scale virtual assistants
• Support a wide variety of APIs or services over many domains
• Zero-shot or few-shot generalization to new services
• Data
• Largest public corpus of multi-domain task-oriented conversations
• Containing annotations for spoken language understanding, dialogue
state tracking, policy imitation learning, language generation.
• Evaluation Metrics
• Joint Goal Accuracy, Average goal accuracy, Requested Slots F1, Active
Intent accuracy
• Results
• Participation from 25 teams
• Mostly based on large pre-trained models like BERT or RoBERTa
• Good performance without any domain or service specific parameters
• Winning team: machine-reading comprehension for non-categorical slots;
wide and deep network for categorical slots; data augmentation by back
translation; additional hand-crafted features

The Eighth Dialog System Technology Challenge (DSTC8)

  • 1.
    The Eighth DialogSystem Technology Challenge (DSTC8) Seokhwan Kim, Michel Galley, Chulaka Gunasekara, Sungjin Lee, Adam Atkinson, Baolin Peng, Hannes Schulz, Jianfeng Gao, Jinchao Li, Mahmoud Adada, Minlie Huang, Luis Lastras, Jonathan K. Kummerfeld, Walter S. Lasecki, Chiori Hori, Anoop Cherian, Tim K. Marks, Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta Nov 2018 – Jan 2019 7 Proposals Received Jan 2019 DSTC8 Planning @ DSTC7 Workshop Mar 2019 4 Tracks Selected Apr - May, 2019 Track Preparation Jun - Oct, 2019 Challenge Period Feb 8, 2020 DSTC8 Workshop @ AAAI-20 • DSTC8 Timeline • Four Main Tracks • Multi-domain Task Completion (Microsoft Research AI & Tsinghua University) • NOESIS II: Predicting Responses Identifying Success, and Managing Complexity in Task-Oriented Dialogue (IBM Research AI & University of Michigan) • Audio Visual Scene-Aware Dialog (Mitsubishi Electric Research Laboratories) • Schema-Guided State Tracking (Google Research) • Workshop@AAAI-20 • One-day Workshop • Feb 8, 2020 in New York • Online registration will be open until Jan 10, 2020. • DSTC9 Planning Session • Call for DSTC9 Track Proposals by Jan 25, 2020 • Proposal Presentations @ DSTC8 Workshop Track 3: Audio Visual Scene-Aware Dialog Track 4: Schema-Guided Dialogue State Tracking Track 1: Multi-Domain Task-Completion • Task #1: End-to-End Multi-Domain Task • Goal: Train an end-to-end dialog system that takes natural language as input, and output natural language as response • Data: MultiWOZ 2.0 from Budzianowski et al. (EMNLP 2018) • 7 domains: attraction, hospital, police, hotel, restaurant, taxi, and train. • Additional annotation on the user act is also provided. • Evaluation: E2E performance with both automatic & human evaluation • Results • 12 Submissions; a conventional pipeline approach or an end-to-end approach. • Some differences between human and automatic evaluation. • The winning team (by human evaluation) leverages GPT-2. • Task #2: Fast Adaptation Task • Goal: Train an end-to-end dialog system to adapt to a new goal-oriented domain given very few in-domain sample dialogs • Data • Large Reddit dataset of 5 million dialogs over 1000 domains (subreddits). • MetaLWOz dataset with over 38,000 goal-oriented dialogs, covering 47 diverse domains divided into 226 tasks. • Evaluation • Automatic Evaluation on MultiWOZ • Human Evaluation on held-out MetaLWOz-domains • Results • Four submissions; transfer learning or fine-tuning with LSTM and Transformer • Significant difference between automatic and human evaluation. • Human evaluation ordering robust under bootstrapping. • Winning team (by human evaluation) used hybrid GPT-2 generator and ranker. Track 2: NOESIS II: Predicting Responses • Task: Dialogue generation via next utterance • Input: Multi-Party Conversation Context + 100 Candidate Utterances • Output: Rankings including ‘none’ • Data in two domains: Ubuntu support & student advising. • Evaluation Metrics • Recall@N: In what fraction of cases is the true answer in the top N produced by the system? • Mean Reciprocal Rank: Ubuntu data from Kummerfeld et. al., (ACL 2019) Advising data from DSTC7 Almost all teams used BERT based approaches • Results (Ubuntu) • Results (Advising) Data augmentation methods and balancing positive and negative samples were a crucial part of the best approaches. DSTC mailing list: list@dstc.communityhttps://sites.google.com/dstc.community/dstc8 • Task • Building machines having a conversation with humans about the objects and events around them to interact with the real world through understanding dynamic audiovisual scenes. • Input: a video, dialog history about the video, a follow-up question • Output: Generate a correct response to the follow-up question • Data • Video Content: Charades Dataset [Sigurdsson et al. 2016] • Dialog Collection: AVSD dialogs collected via MTurk • Two turkers have a dialog, consisting of 10 rounds of Q&A. • Then the Questioner writes a summary of the video. • Results • Received 27 system submission from 12 teams • The best system applied "Universal Multimodal Tansformer: Fine tuned seq-to-seq model with GPT-2 embedding". • Task • Develop dialogue state tracking systems suitable for scale and complexity of large-scale virtual assistants • Support a wide variety of APIs or services over many domains • Zero-shot or few-shot generalization to new services • Data • Largest public corpus of multi-domain task-oriented conversations • Containing annotations for spoken language understanding, dialogue state tracking, policy imitation learning, language generation. • Evaluation Metrics • Joint Goal Accuracy, Average goal accuracy, Requested Slots F1, Active Intent accuracy • Results • Participation from 25 teams • Mostly based on large pre-trained models like BERT or RoBERTa • Good performance without any domain or service specific parameters • Winning team: machine-reading comprehension for non-categorical slots; wide and deep network for categorical slots; data augmentation by back translation; additional hand-crafted features