This paper proposes a new approach called FL-MSRE for multimodal social relation extraction that combines few-shot learning and information from text and face images. The paper created a dataset of text mentioning two individuals, their face images, and their annotated social relationship from TV shows and novels. FL-MSRE outperformed extracting social relations from text alone by leveraging both text and face embeddings. It also achieved similar performance when using face images from different images compared to the same image, demonstrating its stability.