The main goal of the thesis is to develop a video retrieval system based on the audio content, using deep learning techniques. The method developed within the context of the thesis, constitutes the adjustment on the audio content of the state of the art method ViSiL. ViSiL establishes a video similarity learning architecture and captures the spatio-temporal relations between videos. The proposed method is called ViSiLaudio. In order to extract representative video descriptors, transfer learning from a convolutional neural network trained on a large scale dataset of audio events is employed. A similarity matrix is produced by compairing the descriptors of two videos, that contains the similarity scores between each time frame of the one video with each time frame of the other. This matrix is further provided to a convolutional neural network, in order to capture temporal structures in the similarity matrix between the videos. The output of the above network is summarized using Chamfer Similarity to a final similarity score between the compared videos. The proposed network is trained using the triplet loss function, that increases the similarity score between two relevant videos and decreases the similarity between videos that are irrelevant. Ιn order to test the efficiency of ViSiLaudio on the problem of video retrieval based on audio content, annotation of the audio relations between videos on dataset FIVR-200K was carried out. Also, in terms of evaluating the proposed method, two state of the art methods are re-implemented. Regarding the new dataset that occured, method ViSiLaudio outperforms competition by 14% and 34% respectively. Also, the proposed method was evaluated on three visual based video retrieval datasets. In two of the three datasets, ViSiLaudio outperforms the competition, while on the third dataset, one of the compared methods outperforms marginally ViSiLaudio. Finally, the hypothesis audio methods in combination with visual ones can enhance the results, is investigated. This combination improves the results, but the improvement is marginal.