This document discusses a system for multi-speaker detection and tracking in an e-learning classroom using audio and video sensors and gesture analysis. The system uses microphones and cameras in both a real classroom and virtual classrooms to detect when students raise their hands or speak. It prioritizes which student interrupted first and focuses the PTZ camera on that student. Audio signals are used for voice detection and video signals are used to detect hand gestures. The system aims to make virtual classrooms more similar to real classrooms by enabling multiple students to ask questions simultaneously through gesture analysis and camera focus.