The Fifth Dialog State Tracking Challenge (DSTC5)
Seokhwan Kim1
, Luis Fernando D’Haro1
, Rafael E. Banchs1
, Jason D. Williams2
, Matthew Henderson3
, Koichiro Yoshino4
1
Institute for Infocomm Research, Singapore. 2
Microsoft Research, USA. 3
Google, USA. 4
Nara Institute of Science and Technology, Japan.
Problems
Goal
Human-human dialogs on tourist information in English and Chinese
Focusing on the problem of adaptation to a new language
Main Task
Dialog State Tracking (DST)
Pilot Tasks
Spoken Language Understanding (SLU)
Speech Act Prediction (SAP)
Spoken Language Generation (SLG)
End-to-end System (EES)
Datasets
Dialogs
Set Task Language # dialogs # utterances
Train ALL English 35 31,304 ← DSTC4 datasets
Dev ALL Chinese 2 3,130
Test MAIN Chinese 10 14,878
Test SLU Chinese 8 12,655
Test SAP Chinese 8 11,456
Test SLG Chinese 8 12,346
Translations
5-best translations were provided for each utterance with word alignments
generated by English-to-Chinese and Chinese-to-English MT systems
The ontology for DSTC4 was given with its automatic translation to Chinese
Main Task: Dialog State Tracking
Task Definition
Dialog state tracking for each sub-dialog level
Input
Transcribed utterances from the beginning of the session to each timestep
Manually segmented by sub-dialogs and annotated with topic categories
Output
Frame structures defined with slot-value pairs
For 5 major topic categories: Accommodation, Attraction, Food, Shopping, Transportation
Example
Speaker Utterance Dialog State
Guide 我介绍你这个甘榜格南。 (I recommend you this Kampong Glam.) TOPIC: Attraction
TYPE OF PLACE:
Ethnic enclave
NEIGHBORHOOD:
Kampong Glam
Tourist 对。(Right.)
Guide 你看,它是个-它是马来村嘛
(You see, it is a- it’s a Malay Village)
Tourist 对,甘榜- (Right, Kampong-)
Guide 它就卖了很多马来食物。 (It sells a lot of Malay food.) TOPIC: Food
CUISINE:
Malay cuisine
NEIGHBORHOOD:
Kampong Glam
Tourist 比较有特色的食物, (It’s quite a unique food,)
Guide 对,哦。(Right.)
Guide 马来食物,基本上,它是香。
(Malay food, basically, it smells very nice.)
Tourist 那我们住宿呢?(Then, where do we stay?)
TOPIC: Accommodation
INFO: Pricerange
NAME: V Hotel
Guide 我介绍一间呵,叫V Hotel的。 (Let me recommend to you, the V Hotel.)
Guide 这个酒店,价格这个不贵。 (This hotel, the price is not expensive.)
Tourist 好的。 (Okay.)
Guide 如果要去,我建议的这个马来文化村,
TOPIC: Transportation
INFO: Duration
TYPE: Walking
FROM: V Hotel
TO: Kampong Glam
(If you want to go, I suggest this Malay cultural village,)
Tourist 马来村? (Malay village?)
Guide 步行大概我看十五分钟吧。 (I think it take fifteen minutes on foot.)
Tourist 好。 (That’s good.)
Main Task: Dialog State Tracking
Baselines
Fuzzy string matching between ontology entries and utterances (DSTC4)
Baseline 1: Translations in English with the original ontology in English
Baseline 2: Original utterances in Chinese with the translated ontology in Chinese
Evaluation
Schedules: (1) every turn; (2) only at the end of each sub-dialog
Metrics: (1) Frame-level Accuracy; (2) Slot-level Precision/Recall/F-measure
Results (32 entries from 9 teams)
Schedule 1 Schedule 2
Team Entry Accuracy F-measure Accuracy F-measure
0 0 0.0250 0.1124 0.0321 0.1462 ← Baseline 1
0 1 0.0161 0.1475 0.0222 0.1871 ← Baseline 2
1 0 0.0397 0.3115 0.0551 0.3565
1 1 0.0386 0.3032 0.0597 0.3540
1 2 0.0393 0.3071 0.0551 0.3563
1 3 0.0387 0.3052 0.0597 0.3580
1 4 0.0417 0.3166 0.0612 0.3675
2 0 0.0736 0.3966 0.0964 0.4430
2 1 0.0567 0.3764 0.0712 0.4267
2 2 0.0529 0.3756 0.0681 0.4259
2 3 0.0788 0.4047 0.0956 0.4519
2 4 0.0699 0.4024 0.0872 0.4499
3 0 0.0351 0.2060 0.0505 0.2539
3 1 0.0303 0.2424 0.0367 0.2830
3 2 0.0289 0.2074 0.0406 0.2573
3 3 0.0341 0.2442 0.0451 0.2895
4 0 0.0583 0.3280 0.0765 0.3658
4 1 0.0407 0.3405 0.0413 0.3572
4 2 0.0515 0.3708 0.0635 0.3945
4 3 0.0552 0.3649 0.0681 0.3913
4 4 0.0454 0.3572 0.0559 0.3758
5 0 0.0330 0.2749 0.0520 0.3314
5 1 0.0187 0.1804 0.0230 0.1967
5 2 0.0183 0.1520 0.0168 0.1371
5 3 0.0313 0.1574 0.0413 0.1880
5 4 0.0093 0.0945 0.0115 0.0977
6 0 0.0389 0.2849 0.0482 0.3230
6 1 0.0340 0.3070 0.0383 0.3532
6 2 0.0491 0.2988 0.0643 0.3381
7 0 0.0092 0.0783 0.0107 0.0794
7 1 0.0085 0.0767 0.0115 0.0809
8 0 0.0192 0.1570 0.0214 0.1554
8 1 0.0068 0.0554 0.0069 0.0577
9 0 0.0231 0.1114 0.0314 0.1449
Pilot Task: Spoken Language Understanding
Task Definition
Input: Transcribed utterance at each timestep
Output
Speech Act: 4 main categories with 21 attributes
Semantic Tags: 8 main categories with subcategories, relative modifiers and from-to modifiers
Example
Input: 我介绍你这个甘榜格南。 (I recommend you this Kampong Glam.)
Speech Act: INI (RECOMMEND)
Semantic Tags: 我介绍你这<LOC CAT=“CULTURAL”>个甘榜格南</LOC>。
(I recommend you this <LOC CAT=“CULTURAL”>Kampong Glam</LOC>.)
Pilot Task: Spoken Language Understanding
Baselines: SVM for Speech Acts and CRF for Semantic Tags
Evaluation Metrics: Precision/Recall/F-measure
Results on Speech Acts (12 entries from 4 teams)
Guide Tourist
Team Entry P R F P R F
0 0 0.4588 0.2480 0.3219 0.3694 0.1828 0.2446 ← SVM baseline
2 0 0.5450 0.3911 0.4554 0.5001 0.5501 0.5239
2 1 0.5305 0.3969 0.4540 0.5331 0.5263 0.5297
2 2 0.5533 0.3829 0.4526 0.5107 0.5425 0.5261
2 3 0.5127 0.4251 0.4648 0.5605 0.4999 0.5285
3 0 0.4279 0.3583 0.3900 0.4591 0.4241 0.4409
3 1 0.4340 0.3635 0.3956 0.4498 0.4119 0.4300
5 0 0.4085 0.3364 0.3690 0.5026 0.4484 0.4739
5 1 0.3905 0.3216 0.3527 0.4519 0.4031 0.4261
5 2 0.4639 0.3820 0.4190 0.4916 0.4385 0.4635
5 3 0.4540 0.3739 0.4101 0.4871 0.4346 0.4594
5 4 0.4459 0.3672 0.4028 0.4984 0.4446 0.4700
7 0 0.5007 0.2976 0.3733 0.5079 0.4156 0.4571
Results on Sementic Tags (8 entries from 3 teams)
Guide Tourist
Team Entry P R F P R F
0 0 0.4666 0.3187 0.3787 0.5259 0.2659 0.3532 ← CRF baseline
3 0 0.4650 0.3182 0.3779 0.5331 0.2620 0.3513
3 1 0.4650 0.3182 0.3779 0.5331 0.2620 0.3513
5 0 0.5006 0.2923 0.3691 0.5083 0.3110 0.3859
5 1 0.5469 0.1893 0.2813 0.5121 0.3081 0.3847
5 2 0.3577 0.2476 0.2926 0.3031 0.2237 0.2574
5 3 0.3486 0.2541 0.2939 0.2932 0.2149 0.2480
5 4 0.3395 0.2111 0.2603 0.2947 0.2072 0.2433
7 0 0.4400 0.3207 0.3710 0.4408 0.2926 0.3517
Pilot Task: Spoken Language Generation
Task Definition
Input: Speech act and semantic tags at each time step
Output: Generated utterance
Example
Input: INI (RECOMMEND), <LOC CAT=“CULTURAL”>Kampong Glam</LOC>
Output: 我介绍你这个甘榜格南。 (I recommend you this Kampong Glam.)
Baseline
Example-based language generation
Using k-nearest neighbors algorithm on speech acts and semantic tags
Evaluation Metrics
BLEU: Geometric average of n-gram precision of system outputs to references
AM-FM: Linear interpolation of cosine similarity and normalized n-gram probability
Results (4 entries from 1 team)
Guide Tourist
Team Entry AM-FM BLEU AM-FM BLEU
0 0 0.1981 0.3854 0.2602 0.5921 ← Baseline
5 0 0.2818 0.3264 0.3221 0.4850
5 1 0.3180 0.3371 0.3635 0.5249
5 2 0.2737 0.2852 0.3100 0.4741
5 3 0.2405 0.2758 0.4258 0.5302
* More details can be found from our paper in the SLT proceeding, DSTC5 official website (http://workshop.colips.org/dstc5/) and DSTC5 GitHub repository (https://github.com/seokhwankim/dstc5).

The Fifth Dialog State Tracking Challenge (DSTC5)

  • 1.
    The Fifth DialogState Tracking Challenge (DSTC5) Seokhwan Kim1 , Luis Fernando D’Haro1 , Rafael E. Banchs1 , Jason D. Williams2 , Matthew Henderson3 , Koichiro Yoshino4 1 Institute for Infocomm Research, Singapore. 2 Microsoft Research, USA. 3 Google, USA. 4 Nara Institute of Science and Technology, Japan. Problems Goal Human-human dialogs on tourist information in English and Chinese Focusing on the problem of adaptation to a new language Main Task Dialog State Tracking (DST) Pilot Tasks Spoken Language Understanding (SLU) Speech Act Prediction (SAP) Spoken Language Generation (SLG) End-to-end System (EES) Datasets Dialogs Set Task Language # dialogs # utterances Train ALL English 35 31,304 ← DSTC4 datasets Dev ALL Chinese 2 3,130 Test MAIN Chinese 10 14,878 Test SLU Chinese 8 12,655 Test SAP Chinese 8 11,456 Test SLG Chinese 8 12,346 Translations 5-best translations were provided for each utterance with word alignments generated by English-to-Chinese and Chinese-to-English MT systems The ontology for DSTC4 was given with its automatic translation to Chinese Main Task: Dialog State Tracking Task Definition Dialog state tracking for each sub-dialog level Input Transcribed utterances from the beginning of the session to each timestep Manually segmented by sub-dialogs and annotated with topic categories Output Frame structures defined with slot-value pairs For 5 major topic categories: Accommodation, Attraction, Food, Shopping, Transportation Example Speaker Utterance Dialog State Guide 我介绍你这个甘榜格南。 (I recommend you this Kampong Glam.) TOPIC: Attraction TYPE OF PLACE: Ethnic enclave NEIGHBORHOOD: Kampong Glam Tourist 对。(Right.) Guide 你看,它是个-它是马来村嘛 (You see, it is a- it’s a Malay Village) Tourist 对,甘榜- (Right, Kampong-) Guide 它就卖了很多马来食物。 (It sells a lot of Malay food.) TOPIC: Food CUISINE: Malay cuisine NEIGHBORHOOD: Kampong Glam Tourist 比较有特色的食物, (It’s quite a unique food,) Guide 对,哦。(Right.) Guide 马来食物,基本上,它是香。 (Malay food, basically, it smells very nice.) Tourist 那我们住宿呢?(Then, where do we stay?) TOPIC: Accommodation INFO: Pricerange NAME: V Hotel Guide 我介绍一间呵,叫V Hotel的。 (Let me recommend to you, the V Hotel.) Guide 这个酒店,价格这个不贵。 (This hotel, the price is not expensive.) Tourist 好的。 (Okay.) Guide 如果要去,我建议的这个马来文化村, TOPIC: Transportation INFO: Duration TYPE: Walking FROM: V Hotel TO: Kampong Glam (If you want to go, I suggest this Malay cultural village,) Tourist 马来村? (Malay village?) Guide 步行大概我看十五分钟吧。 (I think it take fifteen minutes on foot.) Tourist 好。 (That’s good.) Main Task: Dialog State Tracking Baselines Fuzzy string matching between ontology entries and utterances (DSTC4) Baseline 1: Translations in English with the original ontology in English Baseline 2: Original utterances in Chinese with the translated ontology in Chinese Evaluation Schedules: (1) every turn; (2) only at the end of each sub-dialog Metrics: (1) Frame-level Accuracy; (2) Slot-level Precision/Recall/F-measure Results (32 entries from 9 teams) Schedule 1 Schedule 2 Team Entry Accuracy F-measure Accuracy F-measure 0 0 0.0250 0.1124 0.0321 0.1462 ← Baseline 1 0 1 0.0161 0.1475 0.0222 0.1871 ← Baseline 2 1 0 0.0397 0.3115 0.0551 0.3565 1 1 0.0386 0.3032 0.0597 0.3540 1 2 0.0393 0.3071 0.0551 0.3563 1 3 0.0387 0.3052 0.0597 0.3580 1 4 0.0417 0.3166 0.0612 0.3675 2 0 0.0736 0.3966 0.0964 0.4430 2 1 0.0567 0.3764 0.0712 0.4267 2 2 0.0529 0.3756 0.0681 0.4259 2 3 0.0788 0.4047 0.0956 0.4519 2 4 0.0699 0.4024 0.0872 0.4499 3 0 0.0351 0.2060 0.0505 0.2539 3 1 0.0303 0.2424 0.0367 0.2830 3 2 0.0289 0.2074 0.0406 0.2573 3 3 0.0341 0.2442 0.0451 0.2895 4 0 0.0583 0.3280 0.0765 0.3658 4 1 0.0407 0.3405 0.0413 0.3572 4 2 0.0515 0.3708 0.0635 0.3945 4 3 0.0552 0.3649 0.0681 0.3913 4 4 0.0454 0.3572 0.0559 0.3758 5 0 0.0330 0.2749 0.0520 0.3314 5 1 0.0187 0.1804 0.0230 0.1967 5 2 0.0183 0.1520 0.0168 0.1371 5 3 0.0313 0.1574 0.0413 0.1880 5 4 0.0093 0.0945 0.0115 0.0977 6 0 0.0389 0.2849 0.0482 0.3230 6 1 0.0340 0.3070 0.0383 0.3532 6 2 0.0491 0.2988 0.0643 0.3381 7 0 0.0092 0.0783 0.0107 0.0794 7 1 0.0085 0.0767 0.0115 0.0809 8 0 0.0192 0.1570 0.0214 0.1554 8 1 0.0068 0.0554 0.0069 0.0577 9 0 0.0231 0.1114 0.0314 0.1449 Pilot Task: Spoken Language Understanding Task Definition Input: Transcribed utterance at each timestep Output Speech Act: 4 main categories with 21 attributes Semantic Tags: 8 main categories with subcategories, relative modifiers and from-to modifiers Example Input: 我介绍你这个甘榜格南。 (I recommend you this Kampong Glam.) Speech Act: INI (RECOMMEND) Semantic Tags: 我介绍你这<LOC CAT=“CULTURAL”>个甘榜格南</LOC>。 (I recommend you this <LOC CAT=“CULTURAL”>Kampong Glam</LOC>.) Pilot Task: Spoken Language Understanding Baselines: SVM for Speech Acts and CRF for Semantic Tags Evaluation Metrics: Precision/Recall/F-measure Results on Speech Acts (12 entries from 4 teams) Guide Tourist Team Entry P R F P R F 0 0 0.4588 0.2480 0.3219 0.3694 0.1828 0.2446 ← SVM baseline 2 0 0.5450 0.3911 0.4554 0.5001 0.5501 0.5239 2 1 0.5305 0.3969 0.4540 0.5331 0.5263 0.5297 2 2 0.5533 0.3829 0.4526 0.5107 0.5425 0.5261 2 3 0.5127 0.4251 0.4648 0.5605 0.4999 0.5285 3 0 0.4279 0.3583 0.3900 0.4591 0.4241 0.4409 3 1 0.4340 0.3635 0.3956 0.4498 0.4119 0.4300 5 0 0.4085 0.3364 0.3690 0.5026 0.4484 0.4739 5 1 0.3905 0.3216 0.3527 0.4519 0.4031 0.4261 5 2 0.4639 0.3820 0.4190 0.4916 0.4385 0.4635 5 3 0.4540 0.3739 0.4101 0.4871 0.4346 0.4594 5 4 0.4459 0.3672 0.4028 0.4984 0.4446 0.4700 7 0 0.5007 0.2976 0.3733 0.5079 0.4156 0.4571 Results on Sementic Tags (8 entries from 3 teams) Guide Tourist Team Entry P R F P R F 0 0 0.4666 0.3187 0.3787 0.5259 0.2659 0.3532 ← CRF baseline 3 0 0.4650 0.3182 0.3779 0.5331 0.2620 0.3513 3 1 0.4650 0.3182 0.3779 0.5331 0.2620 0.3513 5 0 0.5006 0.2923 0.3691 0.5083 0.3110 0.3859 5 1 0.5469 0.1893 0.2813 0.5121 0.3081 0.3847 5 2 0.3577 0.2476 0.2926 0.3031 0.2237 0.2574 5 3 0.3486 0.2541 0.2939 0.2932 0.2149 0.2480 5 4 0.3395 0.2111 0.2603 0.2947 0.2072 0.2433 7 0 0.4400 0.3207 0.3710 0.4408 0.2926 0.3517 Pilot Task: Spoken Language Generation Task Definition Input: Speech act and semantic tags at each time step Output: Generated utterance Example Input: INI (RECOMMEND), <LOC CAT=“CULTURAL”>Kampong Glam</LOC> Output: 我介绍你这个甘榜格南。 (I recommend you this Kampong Glam.) Baseline Example-based language generation Using k-nearest neighbors algorithm on speech acts and semantic tags Evaluation Metrics BLEU: Geometric average of n-gram precision of system outputs to references AM-FM: Linear interpolation of cosine similarity and normalized n-gram probability Results (4 entries from 1 team) Guide Tourist Team Entry AM-FM BLEU AM-FM BLEU 0 0 0.1981 0.3854 0.2602 0.5921 ← Baseline 5 0 0.2818 0.3264 0.3221 0.4850 5 1 0.3180 0.3371 0.3635 0.5249 5 2 0.2737 0.2852 0.3100 0.4741 5 3 0.2405 0.2758 0.4258 0.5302 * More details can be found from our paper in the SLT proceeding, DSTC5 official website (http://workshop.colips.org/dstc5/) and DSTC5 GitHub repository (https://github.com/seokhwankim/dstc5).