This document provides an overview of a research paper on using music gestures to separate visual sound sources. It describes a pipeline that uses a context-aware graph network to model body and finger movements from video and associates them with corresponding audio signals using an audio-visual fusion model. Experiments on various music performance datasets show the approach performs better than previous methods at separating sounds of different instruments and same instruments. The research presents a new direction of exploiting structured body dynamics to guide sound separation and proposes an audio-video fusion module.