Image Generation

OpenAI's CLIP Paper Quietly Became the Backbone of Modern AI

Four years later, this 2021 research still powers everything from image generators to autonomous vehicles.

Liza Chan
Liza ChanAI & Emerging Tech Correspondent
December 17, 20255 min read
Share:
Abstract visualization of multimodal AI connecting images and text through neural network connections

OpenAI researchers published a paper in February 2021 titled "Learning Transferable Visual Models From Natural Language Supervision" that introduced CLIP, a model trained on 400 million image-text pairs scraped from the internet. The paper, which later appeared at ICML 2021, has since become one of the most foundational pieces of research in modern AI, though its influence is often invisible to end users.

The bet that paid off

The core insight behind CLIP was almost embarrassingly simple. Instead of training vision models on carefully labeled datasets like ImageNet (which required over 25,000 workers to label 14 million images), why not just learn from the messy, naturally occurring captions that already exist on the web?

The model uses contrastive learning to predict which caption goes with which image, a training task that turned out to be surprisingly scalable. Where previous approaches required task-specific training data, CLIP could generalize to new visual concepts described in plain English. The paper reported matching ResNet-50's accuracy on ImageNet without using any of the 1.28 million training examples that ResNet was trained on.

That claim deserves some scrutiny. Zero-shot performance on ImageNet is one thing, but the benchmarks that matter tend to be the ones companies don't highlight. CLIP struggles with fine-grained classification tasks like differentiating car models or flower species, and its performance on handwritten digits is surprisingly weak: 88% on MNIST versus 99.75% for humans. Not exactly the stuff of AI supremacy.

Where CLIP actually lives now

You've probably used CLIP dozens of times without knowing it. DALL-E 2 builds on CLIP by conditioning its diffusion process on CLIP image embeddings rather than raw text. Stable Diffusion uses a frozen CLIP ViT-L/14 text encoder. Midjourney, though more secretive about its architecture, almost certainly incorporates CLIP-style components.

The model has become infrastructure, not product. Text inputs go to a text encoder, image inputs to an image encoder, and both models output vectors in the same embedding space. This shared representation is what allows you to type "astronaut riding a horse" and get back something plausible.

While OpenAI never explicitly specified the data used to train the original CLIP model, the paper mentions 400 million image-text pairs collected from the internet. That opacity hasn't stopped an entire ecosystem from building on top of it. OpenCLIP recreated the training approach on open datasets. LAION assembled billions of image-text pairs to enable further research.

The limitations nobody talks about

The CLIP paper included a section on limitations that reads, in hindsight, like a warning label that got ignored. Zero-shot CLIP struggles on images not covered in its pre-training dataset, and its classifiers can be sensitive to wording, sometimes requiring trial-and-error prompt engineering.

Empirical investigations reveal that CLIP often fails as a zero-shot classifier in manufacturing applications due to the domain gap between its training data and industrial use cases. The web contains lots of cats and cars, not so many semiconductor defects or welding seams.

There's also the bias problem. The model card on Hugging Face notes that CLIP poses issues with regards to fairness and bias, and the specific biases it exhibits can depend significantly on class design. OpenAI tested classification of people into crime-related categories and found troubling patterns. This hasn't stopped CLIP from being deployed in countless applications with minimal auditing.

Google's answer

Google's SigLIP replaces CLIP's softmax loss with sigmoid, treating each image-text pair independently rather than requiring a global view of all pairs within a batch. This sounds like a minor technical change, but it has real consequences: more efficient training with smaller batch sizes and better performance at the margins.

SigLIP 2, released in early 2025, outperforms its predecessors at all model scales in zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for VLMs. The new version adds captioning-based pretraining, self-distillation, and masked prediction to the training recipe. It also addresses multilingual support and bias concerns more directly than the original CLIP.

Whether this represents CLIP's obsolescence or just its maturation is an open question. Most production systems still use CLIP embeddings. Switching has costs.

What the citations suggest

The CLIP paper has accumulated tens of thousands of citations across computer vision, natural language processing, and robotics. CLIP embeds both images and text into a shared vector space using contrastive learning, enabling tasks like image classification and image-text retrieval with strong zero-shot capabilities. That description now appears in virtually every paper on multimodal AI.

The authors themselves have mostly moved on. Alec Radford, the paper's first author, was previously known for GPT-2. The paper's 12 co-authors read like a roster of OpenAI's research bench at the time: Jong Wook Kim, Aditya Ramesh, Gabriel Goh, Ilya Sutskever.

Sutskever left OpenAI in early 2024. Ramesh led DALL-E development before departing in late 2024. The team that built the foundation has scattered, but the foundation remains.

The real test

Four years is ancient in AI research. Models that seemed impressive in 2021 now look quaint. Yet CLIP persists, not because it's the best at any particular task, but because it established a template that proved remarkably durable.

Contrastive loss functions are crucial in training vision-language models because they help the model learn to distinguish between correct and incorrect image-text pairs. That sentence would have required extensive explanation in 2020. Now it's background knowledge.

The question facing researchers isn't whether CLIP's approach works. It's whether the next generation of models can do better while maintaining compatibility with the ecosystem that's grown up around it. Google is betting on SigLIP. Meta has its own approaches. OpenAI has been characteristically quiet about what comes next.

The CLIP paper remains freely available on arXiv, and the code sits on GitHub with an MIT license. Whatever replaces it will likely cite it.

Tags:CLIPOpenAImultimodal AIcomputer visionmachine learningcontrastive learningzero-shot learningStable DiffusionDALL-E
Liza Chan

Liza Chan

AI & Emerging Tech Correspondent

Liza covers the rapidly evolving world of artificial intelligence, from breakthroughs in research labs to real-world applications reshaping industries. With a background in computer science and journalism, she translates complex technical developments into accessible insights for curious readers.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

OpenAI's CLIP Paper Quietly Became the Backbone of Modern AI | aiHola