Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching

NeurIPS Conference (2022)

Byoungjip Kim, Sungik Choi, Dasol Hwang, Moontae Lee, Honglak Lee

Abstract

Despite surprising performance on zero-shot transfer, pre-training a large-scale multimodal model is often prohibitive as it requires a huge amount of data and computing resources. In this paper, we propose a method (BeamCLIP) that can effectively transfer the representations of a large pre-trained multimodal model (CLIP ViT-B/16) into a small target model (ResNet-50). For unsupervised transfer, we introduce \textit{cross-modal similarity matching} (CSM) that enables a student model to learn the representations of a teacher model by matching the relative similarity distribution across text prompt embeddings. To better encode the text prompts, we design \textit{context-based prompt augmentation} (CPA) that can alleviate the lexical ambiguity of input text prompts. Our experiments show that unsupervised representation transfer of a large pre-trained multimodal model can provide good ImageNet linear probe accuracy (74.8%), while outperforming the existing self-supervised learning methods (SimCLR: 65.6%) and closing the gap with supervised learning (76.2%).