At SpeechTEK 2009 in New York on August 24, 2009, Dr. Daniel C Burnett, Director of Speech Technologies at Voxeo, spoke on optimizing speech recognizer rejection thresholds. Abstract:
This session will explain ASR (automatic speech recognizer) confidence rejection thresholds: what they are, where they come from, and their criticality to your ASR-enabled IVR. We describe the steps necessary to optimize this important threshold value throughout your application, covering transcription, the importance of grammar coverage, and an explanation of terms such as the Equal Error Rate. This session is ideal for those ready to take their ASR-enabled IVR tuning to the next level.
2. Why this talk?
• Sometimes we forget the basics, which are:
• Recognizers are not perfect
• They can be optimized in a
straightforward manner
• The simplest optimization is the rejection
threshold
3. The Goal
• End user goal: optimal experience
• Our Goal: determine user experience for
each possible rejection threshold, then
choose optimum threshold
• Must compare true classification of an
audio sample against the ASR engine’s
classification
4. True classifications
• Assume human-level recognition
• App should still distinguish (i.e. possibly behave
differently) among the following cases:
Case Possible behavior
No speech in audio sample Mention that you didn’t hear
(nospeech) anything and ask for repeat
Speech, but not intelligible
Ask for repeat
(unintelligible)
Intelligible speech, but not in
app grammar Encourage in-grammar speech
(out-of-grammar)
Intelligible speech, and within
app grammar (in-grammar) Respond to what person said
6. Crossing these two . . .
ASR
nospeech rejected recognized
Correct Improperly
nospeech Incorrect
classification rejected
Improperly Correct Assume
unintelligible
treated as silence behavior incorrect
True
out-of- Improperly Correct
Incorrect
grammar treated as silence behavior
Improperly Improperly Either correct
in-grammar
treated as silence rejected or incorrect
7. Crossing these two . . .
Misrecognitions
ASR
nospeech rejected recognized
Correct Improperly
nospeech Incorrect
classification rejected
Improperly Correct Assume
unintelligible
treated as silence behavior incorrect
True
out-of- Improperly Correct
Incorrect
grammar treated as silence behavior
Improperly Improperly Either correct
in-grammar
treated as silence rejected or incorrect
8. Crossing these two . . .
“Misrejections”
ASR
nospeech rejected recognized
Correct Improperly
nospeech Incorrect
classification rejected
Improperly Correct Assume
unintelligible
treated as silence behavior incorrect
True
out-of- Improperly Correct
Incorrect
grammar treated as silence behavior
Improperly Improperly Either correct
in-grammar
treated as silence rejected or incorrect
9. Crossing these two . . .
“Missilences” ASR
nospeech rejected recognized
Correct Improperly
nospeech Incorrect
classification rejected
Improperly Correct Assume
unintelligible
treated as silence behavior incorrect
True
out-of- Improperly Correct
Incorrect
grammar treated as silence behavior
Improperly Improperly Either correct
in-grammar
treated as silence rejected or incorrect
10. Three types of errors
• Missilences -- called silence, but wasn’t
• Misrejections -- rejected inappropriately
• Misrecognitions -- recognized
inappropriately or incorrectly
11. Three types of errors
• Missilences -- called silence, but wasn’t
• Misrejections -- rejected inappropriately
• Misrecognitions -- recognized
inappropriately or incorrectly
So how do we evaluate these?
12. Evaluating errors
1. Evaluation data set
2. Try every rejection threshold value
3. Plot errors as function of threshold
4. Select optimal value for your app
13. 1. Evaluation data set(s)
• Data selection
• Must be representative (“every nth call”)
• Ideally at least 100 recordings per grammar path
for good confidence in results
• Transcription
• Goal is to compare against recognition results,
so no punctuation, coughs, etc. needed in
transcription itself (but good to have in separate
comments)
14. 2. Try every rejection
threshold value
• Run recognizer in batch mode with
rejection threshold of 0 (i.e., no rejection)
Remember to collect confidence scores!
• Then, for each threshold from 0 to 100
• Calculate number of misrecognitions,
misrejections, and missilences
18. 4. Select optimal value
• Equal-error-rate: not necessarily the optimum
19. 4. Select optimal value
• Equal-error-rate: not necessarily the optimum
• Minimum of the sum: good starting point, great for
comparing across engines (on same data set only!!)
20. 4. Select optimal value
• Equal-error-rate: not necessarily the optimum
• Minimum of the sum: good starting point, great for
comparing across engines (on same data set only!!)
• Optimal: depends on your app; some errors may
be more critical than others
21. 4. Select optimal value
• Equal-error-rate: not necessarily the optimum
• Minimum of the sum: good starting point, great for
comparing across engines (on same data set only!!)
• Optimal: depends on your app; some errors may
be more critical than others
• Question: if missilences not affected by threshold,
why did I include it?
22. Further optimizations
• Move OOG into IG category if semantically
correct (“You bet” -> “yes”)
• Consider additional threshold for
confirmation
• Optimize endpointer parameters (affects
missilences and/or “too much speech”)