The document summarizes a research paper on multi-modal weakly supervised dense event captioning. The paper aims to show that audio signals alone can achieve good performance on dense event captioning tasks. It uses a dataset of 20k videos with temporally localized sentences. The model uses different audio and video feature representations, which are fused using various strategies. Evaluation results show that fusing MFCC and SoundNet audio features with C3D video features using MUTAN fusion outperforms unimodal baselines. However, the model still has limitations in correctly detecting all event beginnings and generating captions for all events in weakly supervised videos.