A Study on the Video SceneRetrieving Systemwith a Speech Recognizer2013. 5. 14Yoshika OSAWAKohno Lab.
Outline1. Introduction2. Aim of Study3. Composition of Systemi. Voice Divide Sectionii. Speech Recognize Sectioniii. Scene...
1. Introduction• A variety of video data are beinggenerated, stored, and accessed withadvances in the Internet.• To make s...
1. Introduction• Multimedia Annotationso Nagao(2001)
1. Introduction• A Subtitling System for BroadcastPrograms with a Speech Recognizero Ando et al.(2001)
1. Introduction• Extracting voices from the video.• The advantage of voice :Easy to Make texts.Simple association.Apply th...
Outline1. Introduction2. Aim of Study3. Composition of Systemi. Voice Divide Sectionii. Speech Recognize Sectioniii. Scene...
2. Aim of StudyImplement a scene retrievingsystem, then verify the accuracy andcheck the operations.Make annotations with ...
Outline1. Introduction2. Aim of Study3. Composition of Systemi. Voice Divide Sectionii. Speech Recognize Sectioniii. Scene...
3. Composition of SystemStartEndSelect a VideoSpeech Recognize SectionInput a KeywordScene Retrieve SectionOutput the resu...
i. Voice Divide Section• Focus on the Amplitudeo Use signals while exceeding the thresholdvalue of the amplitude.o Reject ...
ii. Speech RecognizeSection
(1) Pre-Processing Unit• Digitizationo Sampling frequency: 16kHzo Quantization bit : 16bit• Noise Reductiono Additive: Sub...
(2) Feature Extraction UnitResonant frequency is effective as a feature value
• Resolution of human hearingo Higher sensitivity in lower frequency• Filter that matches the human hearingMel-frequency(2...
• Inverse Fourier transform in the Mel-frequency axiso New axis: Cepstrumo Separate the voice pitch and resonance frequenc...
(3) Identification UnitFrom Bayes theorem
(3) Identification UnitSpeech waveform : ObservableCharacter information:Unobservable directlyEstimate the character infor...
iii. Scene Retrieve Section• Matching keyword and text1. Input a keyword2. Matching the keyword by String searching3. Extr...
Outline1. Introduction2. Aim of Study3. Composition of Systemi. Voice Divide Sectionii. Speech Recognize Sectioniii. Scene...
4. Evaluation Experiment1. Compare the result with the word I heard2. Calculate the recognition rate3. Evaluate it by each...
4. Evaluation ExperimentTotal average rate is 68%.67%73%69%46% 45%40%0%20%40%60%80%Recognition Rate1 2 3 4 5 6 words
4. Evaluation Experiment• Verify the correspondence betweenkeyword and the seek destinationo Select thumbnail and play fro...
4. Evaluation Experiment• Recognition rate decrease when numberof characters increase.• The retrieved scene is correspondi...
Outline1. Introduction2. Aim of Study3. Composition of Systemi. Voice Divide Sectionii. Speech Recognize Sectioniii. Scene...
5. Conclusion• System for efficient watching videoo Use Speech Recognitiono Make Annotations automatically• Future worko A...
Thank you for your attention!
Upcoming SlideShare
Loading in …5
×

A Study on the Video Scene Retrieving System

221 views
177 views

Published on

Recently, a variety of video data are being generated, stored, and accessed with advances in computer technology and the Int
ernet.
To make search a video, or a video scene quickly from the data, an efficient and effective technique is needed.
So I proposed a video scene retrieval system based on speech recognition which is using HMM(Hidden Markov Model).
The proposed system is applied to scene retrieval experiments that evaluate a recognition rate for 457 short words.
Experiment result shows average detection accuracy is 68%.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
221
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Good afternoon, everyone.I’m Yoshika OSAWA, I am very happy to see all of you today.Let's begin.The theme of my presentation is “A Study on the Video Scene Retrieving System with a Speech Recognizer”.which I studied last year at Gunma National College of Technology.
  • A Study on the Video Scene Retrieving System

    1. 1. A Study on the Video SceneRetrieving Systemwith a Speech Recognizer2013. 5. 14Yoshika OSAWAKohno Lab.
    2. 2. Outline1. Introduction2. Aim of Study3. Composition of Systemi. Voice Divide Sectionii. Speech Recognize Sectioniii. Scene Retrieve Section4. Evaluation Experiment5. Conclusion
    3. 3. 1. Introduction• A variety of video data are beinggenerated, stored, and accessed withadvances in the Internet.• To make search a video scene quickly fromthe data, an efficient technique is needed.
    4. 4. 1. Introduction• Multimedia Annotationso Nagao(2001)
    5. 5. 1. Introduction• A Subtitling System for BroadcastPrograms with a Speech Recognizero Ando et al.(2001)
    6. 6. 1. Introduction• Extracting voices from the video.• The advantage of voice :Easy to Make texts.Simple association.Apply the speech recognition to the sceneretrieving.
    7. 7. Outline1. Introduction2. Aim of Study3. Composition of Systemi. Voice Divide Sectionii. Speech Recognize Sectioniii. Scene Retrieve Section4. Evaluation Experiment5. Conclusion
    8. 8. 2. Aim of StudyImplement a scene retrievingsystem, then verify the accuracy andcheck the operations.Make annotations with the speechrecognition automatically.
    9. 9. Outline1. Introduction2. Aim of Study3. Composition of Systemi. Voice Divide Sectionii. Speech Recognize Sectioniii. Scene Retrieve Section4. Evaluation Experiment5. Conclusion
    10. 10. 3. Composition of SystemStartEndSelect a VideoSpeech Recognize SectionInput a KeywordScene Retrieve SectionOutput the resultVoice Divide Section
    11. 11. i. Voice Divide Section• Focus on the Amplitudeo Use signals while exceeding the thresholdvalue of the amplitude.o Reject because it is not possible to recognize ifit is too short.o Derive threshold based on experiment.axis thresholdAmplitude 10[%]Time 1000[ms]
    12. 12. ii. Speech RecognizeSection
    13. 13. (1) Pre-Processing Unit• Digitizationo Sampling frequency: 16kHzo Quantization bit : 16bit• Noise Reductiono Additive: Subtract the difference between the silenceo Multiplicative: Subtract in the log axisMicrophone characteristics of SM57
    14. 14. (2) Feature Extraction UnitResonant frequency is effective as a feature value
    15. 15. • Resolution of human hearingo Higher sensitivity in lower frequency• Filter that matches the human hearingMel-frequency(2) Feature Extraction Unit
    16. 16. • Inverse Fourier transform in the Mel-frequency axiso New axis: Cepstrumo Separate the voice pitch and resonance frequency• MFCC(Mel Frequency Cepstrum Coefficient)o Information of vowel• ΔMFCCo Infromation of consonant• Feature vectoro (Average power, MFCC, ΔMFCC)(2) Feature Extraction Unit
    17. 17. (3) Identification UnitFrom Bayes theorem
    18. 18. (3) Identification UnitSpeech waveform : ObservableCharacter information:Unobservable directlyEstimate the character informationfrom the waveform by using HMM(Hidden Markov Models)Maximum likelihood calculation : Viterbi algorithmMachine learning : Baum-Welch algorithm
    19. 19. iii. Scene Retrieve Section• Matching keyword and text1. Input a keyword2. Matching the keyword by String searching3. Extract scene that the keyword was spoken.4. Output a thumbnail
    20. 20. Outline1. Introduction2. Aim of Study3. Composition of Systemi. Voice Divide Sectionii. Speech Recognize Sectioniii. Scene Retrieve Section4. Evaluation Experiment5. Conclusion
    21. 21. 4. Evaluation Experiment1. Compare the result with the word I heard2. Calculate the recognition rate3. Evaluate it by each number of charactersSample dataVideo NHK newsTime 3 minutesNumber 30 videosWords 457 wordsEngine Julius
    22. 22. 4. Evaluation ExperimentTotal average rate is 68%.67%73%69%46% 45%40%0%20%40%60%80%Recognition Rate1 2 3 4 5 6 words
    23. 23. 4. Evaluation Experiment• Verify the correspondence betweenkeyword and the seek destinationo Select thumbnail and play from the sceneo Check whether the keyword was spoken.
    24. 24. 4. Evaluation Experiment• Recognition rate decrease when numberof characters increase.• The retrieved scene is corresponding tothe keyword.• Recognition error in weak consonant parto Need improvement in Voice Devide Sectiono Must also improve the recognition accuracy
    25. 25. Outline1. Introduction2. Aim of Study3. Composition of Systemi. Voice Divide Sectionii. Speech Recognize Sectioniii. Scene Retrieve Section4. Evaluation Experiment5. Conclusion
    26. 26. 5. Conclusion• System for efficient watching videoo Use Speech Recognitiono Make Annotations automatically• Future worko Adopt the Zero-Crossing Number in VoiceDevide Sectiono Take in latest Speech Recognition technology.o Incorporate Image Recognition.
    27. 27. Thank you for your attention!

    ×