Multi Speaker Detection And Tracking
Using Audio And Video Sensor Using
Gesture Analysis
By: Abhishek M K
Under the guidance of:
Manjunath Raikar
Asst.Prof
Dept of CSE
CONTENTS
• Introduction
• What is E-Learning class?
• Working
• Block diagram
• Types of virtualization
• Conclusion
• References
INTRODUCTION
• E-learning uses the concept of video conferencing
for interaction between students and tutors in
different locations.
• The tutor’s actual presence is in a real classroom
and the students can view their tutor through a
video in a virtual classroom.
• Audio and video sensors are used to make the E-
learning classroom more efficient.
• Audio sensors such as microphone are used to
receive audio input and video-sensors such as
cameras are used to receive video signals.
• Gestures are used as a form of non-verbal
communication.
• Multiple students asking questions at the
same time can be answered by using gesture
analysis.
What is e-learning class
• The main objective of our work is to make E-learning
classrooms as similar to normal classrooms.
• Multispeaker detection is enabled in the system and
tutor’s gestures are used to make decisions.
• Both the real and the virtual classroom has cameras,
as well as audio sensors.
CONTINUED…
• Students who have questions will either raise their
hand or talk.
• These audio video sensors will collaboratively
work together and detect the first event either in
the virtual or real classroom.
• The PTZ camera will zoom in onto a particular
location and the focus will be on a specific
student.
Working
• The speaker is identified by using a microphone array
and PTZ camera.
• The speaker who first talks is identified either from
virtual or real classroom using audio/video signals.
• The PTZ camera and the audio sensors are used to
track the students who want to speak.
• Students who gesture or speak will be put in a queue,
with priority given to who gestured/speak first.
CONTINUED…
• As the student who first gestures or speaks will
become the focus of the camera.
• The virtual classroom is a place where the
students need a screen to view the professor.
• We need three cameras for taking pictures.
• The students are localized using audio and video
sensors.
Fig 1: The tutor is taking class.His video will be displayed in
remote classroom and remote students video will be displayed in
real classroom
Fig 2: A student in the remote classroom raises his hand for doubt.His
face is focussed in the real classroom as he produces the first interrupt
Block diagram
Real Classroom
Audio-
sensor
Video-
sensor
Human
voice
detector
Detecting
hand
Gesture
Virtual Classroom
Audio-
sensor
Video-
sensor
Human
voice
detector
Detecting
hand
Gesture
Priority Detection System
Localization
Tutor’s Gesture Analysis
Video Sensor
Focus
• The Audio sensors will sense the students
who are asking doubts and the video sensors
will sense the images of the students.
• The audio sensor will be fed to human voice
detecting system for detecting human voice
and the video sensor will be used to detect
hand raise of the students.
• Then we need to use priority detecting
system to detect which event happens first.
• After it’s prioritized, the camera will focus the
particular student who asks doubts first.
• The real and remote classrooms are connected
via internet.
CONTINUED…
TYPES OF VIRTUALIZATION
• Audio Virtualization
• Video Virtualization
Audio virtualization
• For Audio Localization we are using the concept of estimating
time delay between pair of microphones.
• Cross correlation between audio signals is used for getting the
time delay.
• Steps for audio localization
 Obtain audio signals
 Convert to frames calculate average energy of frames
 If it is above a threshold it is speech
 Cross correlate to find the time delay
Video virtualization
• The students hand raise gesture as well as professors gestures
needs to be find out for taking decision in E-class.
• The Gesture analysis Algorithm works on basis of comparison
between the reference frames with the frame to be checked.
• For creating reference image, we need to train the gestures of
different category and save in a database.
• The captured image is compared with each of the reference
frame.
• Those who get the maximum correlation will be detected as
the match.
Conclusion
• The main purpose of the project is to make the E-
Learning classroom more natural by effectively using
gesture analysis of tutor .
• E-learning classroom is a challenge but it will make
the classroom more similar to a real classroom.
References
• [1] Remote Student Localization using Audio and Video
Processing for Synchronous Interactive E-Learning Balaji
Hariharan, Aparna Vadakkepatt, Sangeeth Kumar Amrita
Centre for Wireless Networks and Applications, Amrita
Vishwa Vidyapeetham Kerala, India.
• [2] Sensors for Gesture Recognition Systems-IEEESignal
Berman, Member, IEEE, and Helman Stern, Member, IEEE.
• [3] Robust Joint Audio-Video Localization in Video
Conferencing Using Reliability Information David Lo, Rafik
A. Goubran, Member, IEEE, Richard M. Dansereau, Member,
IEEE, Graham Thompson, and Dieter Schulz .
THANK YOU…..

Multi Speaker Detection using audio and video sensors

  • 1.
    Multi Speaker DetectionAnd Tracking Using Audio And Video Sensor Using Gesture Analysis By: Abhishek M K Under the guidance of: Manjunath Raikar Asst.Prof Dept of CSE
  • 2.
    CONTENTS • Introduction • Whatis E-Learning class? • Working • Block diagram • Types of virtualization • Conclusion • References
  • 3.
    INTRODUCTION • E-learning usesthe concept of video conferencing for interaction between students and tutors in different locations. • The tutor’s actual presence is in a real classroom and the students can view their tutor through a video in a virtual classroom. • Audio and video sensors are used to make the E- learning classroom more efficient.
  • 4.
    • Audio sensorssuch as microphone are used to receive audio input and video-sensors such as cameras are used to receive video signals. • Gestures are used as a form of non-verbal communication. • Multiple students asking questions at the same time can be answered by using gesture analysis.
  • 5.
    What is e-learningclass • The main objective of our work is to make E-learning classrooms as similar to normal classrooms. • Multispeaker detection is enabled in the system and tutor’s gestures are used to make decisions. • Both the real and the virtual classroom has cameras, as well as audio sensors.
  • 6.
    CONTINUED… • Students whohave questions will either raise their hand or talk. • These audio video sensors will collaboratively work together and detect the first event either in the virtual or real classroom. • The PTZ camera will zoom in onto a particular location and the focus will be on a specific student.
  • 7.
    Working • The speakeris identified by using a microphone array and PTZ camera. • The speaker who first talks is identified either from virtual or real classroom using audio/video signals. • The PTZ camera and the audio sensors are used to track the students who want to speak. • Students who gesture or speak will be put in a queue, with priority given to who gestured/speak first.
  • 8.
    CONTINUED… • As thestudent who first gestures or speaks will become the focus of the camera. • The virtual classroom is a place where the students need a screen to view the professor. • We need three cameras for taking pictures. • The students are localized using audio and video sensors.
  • 9.
    Fig 1: Thetutor is taking class.His video will be displayed in remote classroom and remote students video will be displayed in real classroom Fig 2: A student in the remote classroom raises his hand for doubt.His face is focussed in the real classroom as he produces the first interrupt
  • 10.
    Block diagram Real Classroom Audio- sensor Video- sensor Human voice detector Detecting hand Gesture VirtualClassroom Audio- sensor Video- sensor Human voice detector Detecting hand Gesture Priority Detection System Localization Tutor’s Gesture Analysis Video Sensor Focus
  • 11.
    • The Audiosensors will sense the students who are asking doubts and the video sensors will sense the images of the students. • The audio sensor will be fed to human voice detecting system for detecting human voice and the video sensor will be used to detect hand raise of the students. • Then we need to use priority detecting system to detect which event happens first.
  • 12.
    • After it’sprioritized, the camera will focus the particular student who asks doubts first. • The real and remote classrooms are connected via internet. CONTINUED…
  • 13.
    TYPES OF VIRTUALIZATION •Audio Virtualization • Video Virtualization
  • 14.
    Audio virtualization • ForAudio Localization we are using the concept of estimating time delay between pair of microphones. • Cross correlation between audio signals is used for getting the time delay. • Steps for audio localization  Obtain audio signals  Convert to frames calculate average energy of frames  If it is above a threshold it is speech  Cross correlate to find the time delay
  • 15.
    Video virtualization • Thestudents hand raise gesture as well as professors gestures needs to be find out for taking decision in E-class. • The Gesture analysis Algorithm works on basis of comparison between the reference frames with the frame to be checked. • For creating reference image, we need to train the gestures of different category and save in a database. • The captured image is compared with each of the reference frame. • Those who get the maximum correlation will be detected as the match.
  • 16.
    Conclusion • The mainpurpose of the project is to make the E- Learning classroom more natural by effectively using gesture analysis of tutor . • E-learning classroom is a challenge but it will make the classroom more similar to a real classroom.
  • 17.
    References • [1] RemoteStudent Localization using Audio and Video Processing for Synchronous Interactive E-Learning Balaji Hariharan, Aparna Vadakkepatt, Sangeeth Kumar Amrita Centre for Wireless Networks and Applications, Amrita Vishwa Vidyapeetham Kerala, India. • [2] Sensors for Gesture Recognition Systems-IEEESignal Berman, Member, IEEE, and Helman Stern, Member, IEEE. • [3] Robust Joint Audio-Video Localization in Video Conferencing Using Reliability Information David Lo, Rafik A. Goubran, Member, IEEE, Richard M. Dansereau, Member, IEEE, Graham Thompson, and Dieter Schulz .
  • 18.