The document presents a study that evaluates the quality of word representations learned by multimodal pre-trained transformers. It discusses prior work showing that human concepts are grounded in sensory experiences and the advantage of multimodal representations over text-only ones, particularly for concrete words. The study aims to intrinsically evaluate how well semantic representations from models like LXMERT and ViLBERT align with human intuitions by comparing word similarity to human judgments. It obtains static embeddings from contextualized representations to measure semantic similarity independent of context.