A beginner's guide to Vision Language Models (VLMs)

The amount of visual data that we constantly ingest is massive, and our ability to function in an environment may greatly impove when we have access to this modality, thus being able to use it as a context during discussions with Language Models (LM) doesn't seem like a mad man's idea. The following will thus be an introduction to the mechanisms that make this integration possible with Language Models.

Posted Dec 23, 2024 Updated May 24, 2025

A visual language model's architecture

By Sifal Klioui

6 min read

A beginner's guide to Vision Language Models (VLMs)

On the importance of visual information

From the moment we’re born, our brains are inundated with visual information, processing vast amounts of data that shape our understanding of the world. According to calculations made by Yann LeCun, the chief AI scientist at Meta, at age four, a child has processed approximately $10^{15}$ bytes of visual data through their optic nerve, which is about 50 times more than the largest language models have been trained on.

This immense visual intake is crucial for cognitive development, enabling humans to interpret complex scenes, recognize faces, and navigate environments intuitively. In contrast, traditional language models, trained solely on text, lack this rich visual context, limiting their ability to fully comprehend and generate content related to the visual world.

Integrating visual capabilities into language models is not just about adding utility; it’s about enhancing their fundamental ability to process and understand the world more like humans do. Vision-language models (VLMs) bridge this gap by combining visual and textual data, allowing for more nuanced understanding and generation.

A picture is worth a thousand words

A prime example of this advancement is PaliGemma, an open vision-language model developed by Google. PaliGemma integrates the SigLIP a contrastivly trained vision encoder with the Gemma language model, enabling it to process both images and text. This integration allows PaliGemma to perform tasks such as image captioning, visual question answering, and object detection with remarkable proficiency.

By studying models like PaliGemma, we can explore how the fusion of visual and textual data enhances a model’s capabilities, bringing us closer to developing AI systems that perceive and interpret the world in ways akin to human cognition.

In the following sections, we’ll delve deeper into the mechanisms that enable this integration, examining how vision-language models are trained, their architectures, and the potential applications that arise from their enhanced understanding of visual and textual information.

The following image outlines the two main phases of training, first, contrastive pretraining enables us to have a vision encoder capable of generating embeddings that language models can interact with more easily, the second phase: an autorgressive training which is the usual task of language modeling.

A primer on contrastive learning

Contrastive learning is foundational to the training of vision encoders in vision-language models like PaliGemma. Unlike traditional supervised learning, which relies on labeled data, contrastive learning learns representations by aligning similar data points while separating dissimilar ones in the latent space.

For vision-language tasks, this involves aligning image embeddings with their corresponding text embeddings, mapping both modalities into a shared embedding space. This allows the model to establish meaningful relationships between visual and textual data, such as associating the image of a cat with the caption “a cute cat sitting on a windowsill.”

Contrastive Language Image Pretraining (CLIP)

Contrastive Language Image Pretraining as exemplified in models like CLIP integrates two key components:

A text encoder (usually a Transformer encoder like BERT)
A vision encoder (usually a Vision Transformer ViT).

The weights of the vision encoder can be initialized from a pretrained model. Depending on the objective, these weights might be frozen, requiring only the text encoder to adapt to the vision encoder’s outputs. This approach, known as Locked-image Text Tuning (LIT) differs from the method used in SigLIP, where both the vision and text encoder weights are updated during training.

The SigLIP Training Objective

In SigLIP, the vision encoder’s weights are updated alongside those of the text encoder to achieve optimal alignment between modalities. The model is trained to maximize similarity scores for matched image-text pairs while minimizing them for mismatched pairs.

The loss function guiding this process is defined as:

\[\begin{equation*} \mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{N} \left[ (1 - 2\delta_{ij}) \cdot \text{log}(\sigma( (1 - 2\delta_{ij}) \cdot \text{logits}_{ij} )) \right] \end{equation*}\]

Where:

$N$: is the number of text and image embeddings (assuming equal number for both)
$\text{logits}_{ij}$ : Logit score for the similarity between the i-th text embedding and the j-th image embedding. Calculated as: $\text{logits}_{ij} = \text{logit_scale} \cdot \exp(\mathbf{t}_i \cdot \mathbf{v}_j) + \text{logit_bias}$, where:
- $\mathbf{t}_i$: is the i-th text embedding vector
- $\mathbf{v}_j$: is the j-th image embedding vector
- $\text{logit_scale}$ and $\text{logit_bias}$ are learnable parameters, introduced to address the imbalance between positive and negative pairs at the start of training (At initialization, there are many more negative pairs than positive pairs in the dataset)
$\delta_{ij}$: The Kronecker delta function, which equals 1 if i = j (for matching pairs) and 0 otherwise.
$\sigma(x)$: Sigmoid function: $\sigma(x) = \frac{1}{1 + \exp(-x)}$

Visually, this involves maximizing the similarity of the elements in the diagonal (those that match), and minimizing those in the off diagonal (those that don’t match).

With a robust vision encoder now capable of generating embeddings that capture the essence of visual inputs, the next step is integrating these embeddings with a language model. This integration transforms standalone image features into meaningful text, enabling models like PaliGemma to perform tasks such as image captioning, visual question answering, and more.

The Role of the Vision Encoder in Fusion

The embeddings produced by the contrastively trained vision encoder are not standalone. They serve as inputs to the language model, acting as a bridge between visual data and text. In PaliGemma, this involves using a Projection Layer, which is learned transformation maps vision encoder outputs into a space compatible with the language model’s input format (hidden dimension), as the goal is to concatenate these outputs with the text embeddings.

From Joint Pretraining to Task-Specific Fine-Tuning

The training of vision-language models like PaliGemma involves two key phases. In joint pretraining, the model processes images and text simultaneously on tasks like caption generation and visual question answering, ensuring tight alignment between the vision encoder and the language model. After pretraining, task-specific fine-tuning adapts the model to specialized applications, such as object detection or more nuanced text generation.

An interesting design choice in PaliGemma is the use of full attention between the visual embeddings and the text prompt (prefix), rather than strictly causal attention.

Although an interpretation was given by one of the authors, I prefer to stick the empirical evidence a shown in the figure bellow, where on average the results on downstream tasks are better when we only apply the causal mask on the suffix.

This integration is driven by an autoregressive loss function common in language modeling, defined as:

\[\mathcal{L}_{\text{LM}} = -\sum_{i} \log P(t_i \mid t_{<i}, v)\]

Here, $t_i$ represents the $i$-th text token, and $v$ is the vision embedding from the encoder. This loss ensures that the model’s generated text remains consistent with both the prior text tokens and the visual context, effectively unifying the modalities.

Further Exploration

While this overview highlights key concepts behind contrastive learning and its implementation in models like PaliGemma, many intricate details have been omitted for brevity. For those interested in diving deeper, you can explore the raw implementation details on my GitHub repository or check out Umar Jamil’s excellent YouTube video for a more hands-on explanation.

These resources offer a more granular view of the mechanisms and practical challenges, providing a comprehensive understanding for those eager to implement or experiment with these techniques.

Deep-learning, Vision Language Models

This post is licensed under CC BY 4.0 by the author.