SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

model summary


Parallel speech-text data is expensive and hard to collect, compared to paired speech-image, text-image data. To this end, we utilize the large scale pretrained image-language model, CLIP and speech self-supervised model, HuBERT to bridge speech and text together. Under several design of model architecture, we achieve SOTA on image-speech retrieval and we also show that SpeechCLIP can conduct zero-shot speech-text retrieval and keywords discovery from speech utternace.

Links: arXiv | code

Model Structure

In this work, we propose 2 architecture to for integrating HuBERT and CLIP.

model details

Vector Quantization



Image-Speech Retrieval


Zero-Shot Speech-Text Retrieval


Keyword Discovery

kw discovery

kw discovery 1

Cite our work!

@article{speechclip2022, title={SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model}, author={Yi-Jen Shih and Hsuan-Fu Wang and Heng-Jui Chang and Layne Berry and Hung-yi Lee and David Harwath}, journal={IEEE SLT}, year={2022}, publisher={IEEE} }
© 2022,By Ian Shih