The document summarizes an iBrutus computer vision module that uses a Kinect camera to track people, detect faces, and localize sound sources in order to enable an interactive avatar named Brutus to hold natural conversations. It tracks up to 6 people using a depth forest algorithm, detects faces using Viola-Jones classifiers to determine who is talking, and uses microphone array data to highlight the segment where sound is coming from. The system allows Brutus to direct its gaze at speakers and focus on different segments to aid speech recognition. Future work may include occlusion handling, face recognition, and gesture recognition.