CLIP - Contrastive Loss Image Pretraining


CLIP is based on the self supervised contrastive learning paradigm, which means it learns from the similarity and dissimilarity in the unlabelled input data and try to learn representation space, so that similar instances can be clustered together and dissimilar instance are pushed apart.

How one learn using contrastive learning?

In case of contrastive learning for example if the object detection needs to be done in the image dataset, the unlabelled data of images is encoded and projected into the embedding dimensions and then the learning objective is to calculate similarity of the data in one sample set with other sample set, loss function is basically try to learn from the euclidean distance between the various embedding space of samples.

As an approach with CLIP, in pre training stage it tries to learn from variety of supervised image-text dataset to understand the relation between the encoded text and images, so that later it can perform vision task of image classification, object detection in zero-shot.

An image from Notion

Advantage of CLIP

Key Takeaways:

Vertex (Image to caption model) gave less accuracy in zero-shot transfer on 400Mn images

An image from Notion
An image from Notion

Limitations

An image from NotionAn image from Notion

Resources

https://openai.com/index/clip/

https://huggingface.co/openai/clip-vit-base-patch32

https://encord.com/blog/guide-to-contrastive-learning/#:~:text=Contrastive learning is an approach,instances should be farther apart.

Active Research

https://arxiv.org/abs/2103.00020