Modality Extension Deep Dive

2026-02-11: Read through the LlaVA to MoE-LLaVA lineage; those two papers seem like good initial starts. Currently need to look into compatability with LoRA as well as what Q-Former is.

Pre-training for feature alignment: freeze the vision encoder and language model, train the project layer
Fine-tuning: keep the vision encoder frozen, fine tune the language model and projection layer

Issues: catastrophic forgetting, large fine-tuning cost

Note: LLaVA successor MoExtend

Alignment: add a trainable MLP for vision encoder, tune using image-caption pairs for modal alignment
Extension Stage: dtermine whihc MoE layers need extension using an Extender
Fine-tuning stage: fine tuning the added extension given an Instruction dataset while keeping other parameteres frozen

Extension Stage: Extender

randomly sample 10k instruction data related to the new modality as the validation set, with the remaining data forming the sub-training set. The sub-training set is used to train the extender, and then evaluate the new and old model on the validation set, analyzing which layers are most sensitive to the new modality. The new expert is copied from the expert within the layer that reacted most to the new modality, and the router is extended.

MoExtend-Calibration Calibration Module:

Problem: adding new experts intrinsically decreases weightage of the old experts, which causes implicit forgetting; even when not changing old expert weights.
Solution: Add a calibration module after the router that rescales old expert weights to prevent forgetting. Discover exact methodology in code

Fine-Tuning Stage

Note: MoExtend successor MoE-LLaVA

Train the MLP for feature alignment
Train all paremeters aside from the Vision Encoder (VE). Adapt the LLM to become an LVLM with multi-modal capabilities through more complex instructions including tasks such as image logical reasoning and text recognition.
Replicate the FFN as initialization weights for the new experts and train only MoE layers. Each token is processed by the top-k experts selected by the router.

Losses: Load Balancing Loss & Cross Entropy (Auto Regressive)

Ablation Studies:

FlexOlmo 1.