Self-Supervised and Contrastive Learning: BERT, CLIP, and the Unlabeled Data Revolution in Representation Learning

Self-Supervised and Contrastive Learning: BERT, CLIP, and the Unlabeled Data Revolution in Representation Learning

Early deep learning depended heavily on large-scale human-annotated datasets (like ImageNet’s 1.4 million labeled images) — annotation cost was a primary AI deployment barrier. Self-Supervised Learning (SSL) constructs supervision signals from data itself (rather than human labels), enabling models to learn high-quality feature representations from massive unlabeled data. It has become one of modern AI’s foundational technical paradigms.

BERT: Masked Language Modeling Pretraining

BERT (Bidirectional Encoder Representations from Transformers, Google, 2018) is NLP’s foundational self-supervised learning work. Core pretraining task — Masked Language Modeling (MLM): randomly mask 15% of input tokens, have the model predict them. This simple pretraining task on unlabeled text (Wikipedia and BooksCorpus, ~3.3 billion words) produces rich language representations that need only small-scale labeled fine-tuning to achieve current on multiple downstream tasks.

BERT’s success validated the “Pre-train + Fine-tune” paradigm — directly driving the explosive development of GPT, T5, and subsequent large pretrained language models.

CLIP: Image-Text Contrastive Learning

CLIP (Contrastive Language-Image Pre-training, OpenAI, 2021) maps images and text into a shared semantic space through contrastive learning: trained on (image, text description) pairs, maximizing matched-pair vector similarity while minimizing unmatched-pair similarity. Training data: 400 million internet image-text pairs, requiring zero human annotation.

CLIP’s image-text alignment makes it the foundation visual encoder component for virtually all major text-to-image models (Stable Diffusion, DALL-E 3, Midjourney) — converting text prompts into image generation guidance signals.

Visual Contrastive Learning: SimCLR and MoCo

Visual contrastive learning constructs “positive pairs” (different augmented versions of the same image) and “negative samples” (different images), training the model to learn augmentation-invariant visual features. SimCLR (Google, 2020) and MoCo (Meta, 2020) are the two most representative frameworks, approaching supervised training performance on ImageNet linear evaluation.

DINOv2 (Meta, 2023) advanced self-supervised visual representation learning to a new level — surpassing supervised-pretrained ViT on multiple downstream tasks (image classification, object detection, semantic segmentation). Among the strongest purely self-supervised visual feature extractors, now applied in robot vision and medical imaging.

上一篇 自监督学习与对比学习:BERT、CLIP与表示学习的无标注数据革命
下一篇 Rust编程语言:内存安全、无畏并发与系统编程的未来