Future of Computer Vision — Open AI Contrastive Language Image Pre-Training (CLIP)

A transformed-based neural network that uses Contrastive Language–Image Pre-training to classify images

6 min readJun 19, 2021

Couple of weeks ago Open AI has launched two computer vision models that can be the future of vision based technologies — DALL-E and CLIP. DALL-E has the ability to generate text from images, while CLIP classifies a very wide range of images by turning image classification into a text similarity problem.

What is Open AI aiming to solve with CLIP ?

Although in the coming years we have seen a plethora of improvements in the field of computer vision, they all come with some drawbacks -

Present day computer vision models like ResNet, InceptionNet and many others are fully capable of achieving human-level performance on complex image classification datasets but at the cost of availability of large datasets, which is not so easy to create.
Many models currently although perform very well on datasets like ImageNet they begin to under perform exponentially when exposed to variants in the dataset or on “out-of-the-box” data, as they are only optimised for performing on the benchmark and fail to perform in real-life scenarios.

OpenAI aims to solve these problems of large datasets and bad real-life performances with CLIP. CLIP has not only proven to give a state of the art of results on image classification but also other vision tasks such as object classification, action recognition in videos, and OCR. This shows that a single algorithm like CLIP can work with a variety of tasks and datasets without the need to build huge datasets but is computationally expensive.

Principle of Contrastive Learning

Contrastive learning is an approach to formulate the task of finding similar and dissimilar things for an ML model. Using this approach, one can train a machine learning model to classify between similar and dissimilar images.

Contrastive learning has seen a boom of interest in self-supervised learning techniques especially in computer vision with papers like Simclr and Moco.

CLIP — Algorithm Overview

The team at OpenAI have incorporated a lot of state of the art working methods such as zero-shot transfer, natural language supervision, and multimodal learning. Let’s begin with a high-level overview of the working of CLIPs

It starts off with a batch of text and image pairs that can be easily found on the internet. These texts and images are passed into a text and image encoder respectively and a similarity search is done where the images are mapped to the corresponding text from the entire batch. This alignment of image and text is the contrastive pre-training approach. A similar approach has been implemented in the ConVIRT paper in the field of medical imaging.

Once the images and texts have been matched, Zero-shot prediction can be performed. What happens here is that all the classes in the dataset are arranged in a specific format like “a photo of a {classname}’ and that is fed into the encoder. Similar to contrastive pre-training, the image is passed to the encoder and it performs a similarity search to determine which text matches the image from the entire batch, i.e, the text encoder will contain a batch of ‘a photo of a {dog}’, ‘a photo of a {car}’ etc and CLIP will estimate the best pairs with a given image. For example, we can see that the class guacamole ranked 1 out of 101 classes and television ranked 1 out of 397.

We have already seen that zero-shot approaches such as these (GPT-3) are computationally expensive. CLIP utilized a few different approaches to tackle this. The first one we have already seen is the contrastive pre-training approach, which led to significantly less computation. Secondly, it utilizes a vision transformer which further increased the efficiency over standard vision models like the ResNet.

This Zero-shot learning approach coupled with natural language supervision is what differentiates CLIP from the other vision models. By training a wide variety of data easily accessible on the internet and no direct optimising for benchmark, CLIP is much more generalised and representative.

We can see in the above image that the CLIP achieved the language model accuracy at just 33M parameters compared to 400M. CLIP is 12 times more efficient!!
As a result of this methodology, CLIP can easily be applied to nearly any visual classification tasks and achieve great performance.

Now you can see that the team at OpenAI has solved a lot of the problems of the current vision models. CLIP has reduced the labor-intensive large datasets that are required for SOTA computer vision tasks by learning from the text–image pairs that are already publicly available and not only that, it has also reduced the need to focus on a limited number of visual concepts.

Did you know the ImageNet dataset required 25,000 workers to annotate 14 million images for 22,000 object categories? That’s a lot of work!

Imagine using a pre-trained ImageNet model on a specific dataset of your choice. It would require to build a dataset from scratch and fine-tune your model. But all CLIP requires is for you to pass the names of your task’s visual concepts into the text encoder, and it will output a linear classifier of the visual representations.

One thing to note is that the CLIP can match the performance of SOTA vision models on datasets like ImageNet. But what openAI has also tested is that adding the linear classifier on top of the CLIP’s features boosts its accuracy by 10% however fails to generalise well on other variations of ImageNet.

CLIP — The best in the business ?

There are many limitations of CLIPs currently. Significant work is still needed to improve the task learning and transfer capabilities of CLIP. While scaling has so far steadily improved performance a large amount of computing is required for zero-shot CLIP to reach overall state-of-the-art performance. This is infeasible to train with current hardware.

When compared to task-specific models, CLIP has shown poor performance on several types of classification problems such as differentiating models of cars, species of flowers, etc. Another limitation is that CLIP still generalizes poorly to data that is truly out-of-distribution for it. Taking the example of MNIST, a simple baseline of logistic regression outperforms zero-shot CLIP. CLIP hopes that by training on a large and varied dataset that all data will be effectively in-distribution but as MNIST demonstrates, it is easy to violate.

CLIP as we have seen can flexibly generate zero-shot classifiers for a wide variety of datasets but is still limited to only those concepts in a given zero-shot classifier. We have seen an approach like Image Captioning that can generate novel outputs in comparison to ZS CLIP.

Conclusion

CLIP has introduced a really interesting and flexible approach to tackling computer vision problems. Not only can it overcome problems faced by many of today’s vision models and approaches but it does so with flying colors. Its ability to tackle almost any vision problem and still produce amazing results is not a small feat. OpenAI has released both the research paper and code.