CLIP - Contrastive Loss Image Pretraining

CLIP is based on the self supervised contrastive learning paradigm, which means it learns from the similarity and dissimilarity in the unlabelled input data and try to learn representation space, so that similar instances can be clustered together and dissimilar instance are pushed apart.

How one learn using contrastive learning?

In case of contrastive learning for example if the object detection needs to be done in the image dataset, the unlabelled data of images is encoded and projected into the embedding dimensions and then the learning objective is to calculate similarity of the data in one sample set with other sample set, loss function is basically try to learn from the euclidean distance between the various embedding space of samples.

As an approach with CLIP, in pre training stage it tries to learn from variety of supervised image-text dataset to understand the relation between the encoded text and images, so that later it can perform vision task of image classification, object detection in zero-shot.

Advantage of CLIP

Helps in preparing the dataset of image-text easily without requiring expensive annotation.
It also helps to predict large dataset classes/categories “out of box” due to the sheer scale of dataset it has been trained on.
CLIP has good benchmark on the data that they have not be trained on and hence better real world performance, rather than other vision model whose “benchmark performance” and “real performance” differ largely

Key Takeaways:

Vision Transformer in CLIP gave better compute performance 3x than VirTex (Visual representation for Textual Encoding) , Virtex approach performed 4-10x better in small to medium scale experiments.

Vertex (Image to caption model) gave less accuracy in zero-shot transfer on 400Mn images

CLIP also performs comparatively well on various zero-shot task across 27 datasets, including fine grained object classification, localisation, action recognition in videos, and OCR.

Limitations

Can’t do well in deep counting tasks or more granular object identification

Resources

https://openai.com/index/clip/

https://huggingface.co/openai/clip-vit-base-patch32

https://encord.com/blog/guide-to-contrastive-learning/#:~:text=Contrastive learning is an approach,instances should be farther apart.

Active Research

https://arxiv.org/abs/2103.00020