PaliGemma

A treasure trove of findings in the multimodel text and image space

This paper lays out in quite some detail what works when it comes to training the latest generation of multimodal text and image models.

Looking back over the last 5 years, many have tried different ways to incorporate the two modalities to enable a model that can deal in both language and images in either the same representation space, or the marrying of specialist models to bridge the gap. The authors sum up the progression:

Generation Key Models Description Key Features
First Generation CLIP, ALIGN, ConVIRT, VirTex Extension of large-scale classification pretraining to leverage web data without human labeling, using caption embeddings from language encoders. Replaces fixed class sets with caption embeddings, uses language encoders similar to BERT.
Second Generation Generative encoder-decoder models (akin to T5) Unification of captioning and question-answering tasks via generative encoder-decoder modeling, leveraging progress in generative language models. Combines captioning and QA tasks, often backed by generative language model advancements.
Scaled-Up Models Flamingo, BLIP-2, PaLI Further scaling of second-generation models. Enhanced capabilities and performance through scaling up.
Recent Advances Gemini, GPT-4, Moondream Introduction of “instruction tuning” to make raw models more user-friendly, along with systematic studies to identify important factors in VLMs. Instruction tuning for improved usability, systematic studies on VLM effectiveness.

The Google Journey

The Google team go through the progress they have made in the space, with explicit reference to the size and architectures used along the way

%%{init: {'theme': 'base', 'themeVariables': { 'fontFamily': 'monospace'}}}%% graph TD A[PaLI
17B parameters
Image: ViT-e 4B
Text: mT5-XXL 13B
] --> B[PaLI-X
54B parameters
Image: ViT-22B
Text: 32B UL2
] A --> C[PaLM-E
540B+ parameters
Image: ViT-22B
Text: 540B PaLM
] B --> D[PaLI-3
5B parameters
Image: 2B ViT-G/14
Text: 3B UL2
] C --> D D --> E[PaLI-Gemma
< 3B parameters
Image: 400M SigLIP
Text: 2B Gemma
] style A fill:#69e375,stroke:#333,stroke-width:2px style B fill:#fd9b32,stroke:#333,stroke-width:2px style C fill:#fd9b32,stroke:#333,stroke-width:2px style D fill:#e56be5,stroke:#333,stroke-width:2px style E fill:#5dedeb,stroke:#333,stroke-width:2px

PaLI-Gemma

PaLI-Gemma is the latest model in Google’s pathway of multimodal advancements. It features fewer than 3 billion parameters, with a SigLIP image encoder (400M) and Gemma text model (2B). This model stands out for its efficiency and performance across various tasks.

The Models are trained to be a versatile and broadly knowledgeable base model that is effective to transfer, the base models need to be transferred to serve their intended final purpose. The tasks that the models are intended to transfer well on are: Image Classification, Captioning, Visual QA, Dialogue. There is scope for further tasks which are not necessarilty text output (think detection, instance segmentation, panoptic segmentation, depth prediction, colorization..) but they are not developed here.

Training In Stages

The training of PaliGemma follows the same steps as previous PaLI models, with only small modifications. Training consists of several stages

The separation of training into these phases is key:

Stage 0

Takes off the shelf single modality Models: - SigLIP for the Image encoder - Gemma 2B for the Language Model

Tokens

Stage 1

Combines unimodal models and trains them on diverse vision-language tasks, aiming to create a versatile base model rather than just aligning modalities. Unlike common practice, PaliGemma doesn’t freeze the image encoder during this stage, allowing it to learn spatial and relational understanding. To mitigate potential degradation, a slow linear warm-up is used for the image encoder’s learning rate. The model is trained at 224px resolution with 256 image tokens and 128 text tokens for 1 billion examples, ensuring a broad coverage of visual knowledge, concepts, cultures, and languages.

Stage 2

Focuses on increasing image resolution to enhance performance on tasks requiring finer visual details. The model is trained on two additional checkpoints: 448x448 and 896x896 pixel resolutions. This stage uses fewer examples but increases information density, with 50M examples for 448px and 10M for 896px. It maintains the same task mixture as Stage 1 but emphasizes high-resolution tasks and extends text sequence length to 512 tokens. This allows the model to handle more complex visual tasks like detailed object detection, segmentation, and text reading in images, addressing the growing recognition of resolution importance in vision-language models.

Stage 3

Focuses on transfer learning, adapting the pre-trained model (available in 224px, 448px, and 896px resolutions) to specific tasks or use cases. This stage involves fine-tuning the model for various applications, from specialized tasks like COCO Captions or Video Captioning to more general instruction or chat tuning. The transfer process uses a unified recipe with adjustable hyper-parameters, including resolution, epochs, learning rate, and dropout. PaliGemma demonstrates versatility by adapting to tasks involving multiple images or video frames, encoding them separately and concatenating the tokens. This stage showcases the model’s effectiveness across academic benchmarks and its potential for broader applications beyond standard tasks.

Fine Tuning Stage 3 Recipe

In decreasing order of importance

Parameter Values
Resolution 224, 448, 896
Epochs 1, 3, 10, 30, 100
Learning-rate 3e-5, 1e-5, 3e-6
Label-smoothing 0.0, 0.1, 0.3
Dropout in the LLM 0.0, 0.1, 0.3
Weight decay 0.0 or 0.1 × learning-rate
Freeze ViT false, true
Beam-search May benefit captioning

Recommended initial attempt value in bold.

Ablations

The huge Value in this paper is the density of information provided in the ablations, here is where the treasure lies

  1. Prefix-LM with task-prefix supervision on suffix tokens is effective for VLM pretraining.
  2. New token initialization with small Gaussian noise performs better than matching pretrained embeddings' average.
  3. Freezing different parts during pretraining impacts performance differently:
    • Not freezing any part yields the best results.
    • Freezing the language model significantly hurts performance.
  4. Linear connectors outperform MLP connectors slightly.
  5. Using a SigLIP image encoder is more sample-efficient than raw image patches.
  6. Higher resolution generally improves performance due to increased information content and model capacity.
  7. Separate checkpoints for different resolutions are recommended over user-specified resolutions alone.
  8. Stage-specific mixture re-weighting helps slightly but isn't crucial for base models intended for fine-tuning.
  9. Model stability shows that most tasks can achieve near full-data scores even with limited examples (e.g., within 10% using only 4k examples).
  10. Annotations in images work as well as textual prompts for indicating specific elements to be captioned or analyzed.