Ai Notes

Transformer Efficiency Techniques

Flash Attention

  • Computes attention in tiles/blocks rather than materializing the full N×NN \times N attention matrix
  • Uses online softmax to compute attention incrementally, recomputing values during backward pass instead of storing them
  • Reduces memory from O(N2)O(N^2) to O(N)O(N) while maintaining exact attention (not an approximation)
  • Key insight: memory I/O is the bottleneck, not FLOPs—trading compute for memory access is a net win on modern GPUs

KV Cache

  • During autoregressive generation, keys KK and values VV for previous tokens don't change—cache them instead of recomputing
  • Each new token only needs to compute its own K/VK/V and attend to the cached history
  • Reduces per-token inference from O(N)O(N) to O(1)O(1) attention computation
  • Trade-off: memory grows linearly with sequence length LL, which is why techniques like sliding window attention help for long contexts

Grouped Query Attention (GQA)

  • Instead of separate K/V heads per query head (MHA) or one K/V for all queries (MQA), use GG groups
  • Multiple query heads share the same K/V heads within a group (e.g., Hq=8H_q = 8 query heads, Hkv=2H_{kv} = 2 KV heads = 4 queries per KV)
  • Reduces KV cache size by factor of Hq/HkvH_q / H_{kv} while retaining most of MHA's expressiveness
  • Sweet spot between MQA's efficiency and MHA's quality—used in Llama 2 70B, Mistral, etc.

Rotary Embeddings (RoPE)

  • Encodes position by rotating query and key vectors in 2D subspaces based on position index mm
  • Relative position naturally emerges: qmknq_m \cdot k_n depends on (mn)(m - n) due to rotation properties
  • The rotation matrix RθmR_\theta^m applied to position mm uses angles θi=100002i/d\theta_i = 10000^{-2i/d}
  • No learned position embeddings—positions are encoded through geometric rotation
  • Enables length extrapolation (with modifications like NTK-aware scaling or YaRN) beyond training context

LoRA Experts

  • Adapter: Low-Rank Adaptation pair of matrices A,BA, B such that W=W0+BAW = W_0 + BA. Lower computational cost
  • Routing can use hard routing (top-k, top), soft routing (W=W+epe(x)BeAeW' = W + \sum_e p_e(x) B_e A_e), condition-based (task ID, language, domain tag), or learned routing

Auxiliary Losses

  • Load Balancing Loss: Lbal=Ei=1E(niNtokens)2\mathcal{L}_{bal} = E \sum_{i=1}^{E} \left( \frac{n_i}{N_{tokens}} \right)^2; penalizes large imbalances in expert utilization
  • Router Entropy Loss: Lent=Exipi(x)logpi(x)\mathcal{L}_{ent} = -E_x \sum_{i} p_i(x) \log p_i(x); encourages higher entropy in routing decisions
  • Z Loss: keeps logits from exploding

PEFT Strategies

  • Full Fine-Tuning (FFT): Updates all backbone parameters, serving as the upper bound for plasticity but often suffering from severe forgetting.
  • Standard LoRA: A single low-rank adapter (r = 16) applied to all tasks, representing the standard unified-weight approach.
  • Multi-LoRA: Trains independent LoRA adapters for each task, switching them strictly during inference. This serves as a baseline for task isolation but lacks cross-task synergy.
  • Vanilla MoE-LoRA: A standard mixture of LoRA experts with uniform layer distribution and Top-K routing, used to isolate the gains derived specifically from our asymmetric allocation and knowledge-preservation mechanisms

On vs Off Policy Learning

  • On-policy learning: agent learns from data generated by the same policy being updated; stability & alignment as it learns from data the current policy is generating
  • Off-policy learning: agent learns from data generated by a different policy (older versions, a different agent, etc.); sample efficient as they can learn from older data. However, results in value overestimation or value divergence as the learn from stale data. The learned policy drifts away from the states covered at inference time, and errors accumulate rapidly.

Learning Techniques

  • PPO: Proximal Policy Optimization — clips the policy update ratio rt(θ)=πθ(as)πθold(as)r_t(\theta) = \frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} to [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon] to prevent destructively large updates while still improving the policy
  • DPO: Direct Preference Optimization — bypasses reward modeling by directly optimizing the policy on preference pairs; loss increases log-prob of preferred response relative to rejected: L=logσ(βlogπθ(yw)πref(yw)βlogπθ(yl)πref(yl))\mathcal{L} = -\log\sigma(\beta \log\frac{\pi_\theta(y_w)}{\pi_{ref}(y_w)} - \beta \log\frac{\pi_\theta(y_l)}{\pi_{ref}(y_l)})
  • GRPO: Group Relative Policy Optimization — samples multiple responses per prompt, ranks them by reward, and optimizes using relative advantages within each group rather than requiring a separate critic model
  • RLHF: Reinforcement Learning from Human Feedback — trains a reward model on human preference data, then uses PPO to optimize the policy against that reward while constraining KL divergence from a reference model to prevent reward hacking

Forward vs Reverse KL Divergence

  • Forward KL: xp(x)logp(x)q(x)\sum_x p(x) \log \frac{p(x)}{q(x)} — penalizes regions where qq has low probability but pp has high probability; encourages qq to cover the entire support of pp
  • Reverse KL: xq(x)logq(x)p(x)\sum_x q(x) \log \frac{q(x)}{p(x)} — penalizes regions where pp has low probability but qq has high probability; encourages qq to match pp where pp is non-zero

Limitations for MoE

  • Expensive and inflexible as it typically needs gradient access to all experts and substantial additional end-to-end training. Usually requires expert models to have similar structure
  • Potential Solution: dynamic expert creation and topology

Main Methods for Visual Modalities:

  • Adding a visual encoder and projection layer to the LLM
  • Cross-modal attention

Notable LLM Architectures & Papers

OLMoE

  • Mixture-of-Experts variant of OLMo using top-k routing (typically k=2 or k=8 active experts)
  • Each token routed to subset of experts—1B active params from 7B total, for example
  • Uses auxiliary load balancing loss to prevent expert collapse (all tokens going to few experts)
  • Achieves better performance per FLOP than dense models by decoupling parameters from compute

Paper: OLMoE: Open Mixture-of-Experts Language Models

Montessori Instruct

  • Optimizes the teacher LLM to generate synthetic training data tailored to the student's learning preferences
  • Uses influence functions to measure how each synthetic data point affects student's reference loss—positive influence = helpful, negative = harmful
  • Constructs preference pairs from high/low influence data and trains teacher with DPO to favor generating influential examples
  • Key insight: a weaker teacher optimized for the student outperforms a stronger teacher (GPT-4o) using standard synthesis

Paper: Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning

Visual Instruction Tuning

  • Attaches a vision encoder (CLIP) to a pre-trained LLM along with a projection layer that connects the vision and language modalities
  • Created a multimodal instruction-following dataset through a GPT-4 based data generation pipeline using images with captions and bounding boxes
  • Issues: synthetic data quality, visual reasoning depth, computational efficiency (dual-model overhead from CLIP and LLM)

Github Paper: Visual Instruction Tuning

Towards Specialized Generalists: A Multi-Task MoE-LoRA Framework for Domain-Specific LLM Adaptation

  • Assymetric layer wise expert allocation: Nl=Nmin+(NmaxNmin)×(lL)γN_l = N_{min} + (N_{max} - N_{min}) \times \left( \frac{l}{L} \right)^{\gamma}: ensures that lower layers are sparse and preserves fundamental syntax parsing while higher layers caputre high-level task variability
  • Soft-merging with adaptive routing; temperature-scaled (learnable-parameter)routing of weights towards a weighted sum output
  • Rank aware capacity decoupling — different experts are assigned different ranks based on task complexity
  • Dual path architecture: Base experts and Specialist Experts
  • Minimal reasoning degradation
  • Insights: decoupling knwoledge via gradient isolation (dual path), concentrating expert capcity in the top one-third of transformer blocks was optimal
  • Limitation: static expert topology and predefined expert capacity; look into dynamic expert construction

Paper: Towards Specialized Generalists: A Multi-Task MoE-LoRA Framework for Domain-Specific LLM Adaptation

Self-Distillation Enables Continual Learning

  • SDFT trains a student model on its own generated outputs (on-policy) using a frozen snapshot of the model (teacher) conditioned on demonstrators that inject knowledge of the task
  • Context Distillation: The demonstrations are injected as context into the teacher (in-context learning), and using reverse KL-divergence, the student is encouraged to output responses that match the teacher's distribution (but without the demonstration as context)

Github Paper: Self-Distillation Enables Continual Learning

Visual Instruction Tuning

  • Attaches a vision transformer (CLIP ViT-L/14) to a pre-trained LLM along with a projection layer that connects the vision and language modalities
  • Trained on a multimodal instruction-following dataset through a GPT-4 based data generation pipeline using images with captions and bounding boxes

Github Paper: Visual Instruction Tuning

MoExtend: Tuning New Experts for Modality and Task Extension

    1. Alignment stage tunes an MLP to bridge the vision encoder and the LLM. Use image-caption pairs to train the MLP.
    1. Extension stage identifies which MoE layers require new experts by evaluating distribution shifts in expert selection when exposed to new modality data. Partially tunes a model, detects which layers have the highest deviation in expert selection patterns (most sensitive to the new modality), and extends those layers with new experts.
    1. Fine tuning stage trains only the newly added experts and their gate parameters along with a calibration module to prevent interference with existing knowledge; calibration modules are used to adjust gate wieghts and ensure model output remains consistent for previously learned tasks.

Paper: MoExtend Tuning New Experts for Modality and Task Extension Github

MoE-LLAVA: Mixture of Experts for Large Vision-Langauge Models

  • Vision Encoder, Projection Layer, MoE-Enhanced Language Model
    1. Visual token adaptation. 2. General multi-modal understanding pre-empowerment. 3. MoE Layer Training and Sparsification (transition dense model to moe architecture)

Github Paper: MoE-LLAVA: Mixture of Experts for Large Vision-Langauge Models

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

  • Leverages multiple specialized vision encoders based on the specific requirements of each multimodal task: main learning — a single vision encoder is not sufficient for multiple specicialized visual contexts sucha s medical images, charts, or fine-grained text-recognition tasks
  • Notable: BiomedCLIP (medical images)
  • Further research: Qwen-VL, SPHINX, DINOv2, MoV-Adapter

Github Paper: MoVA: Adapting Mixture of Vision Experts to Multimodal Context

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

MoME

  • Mixture of Vision Experts: uses Adaptive Deformable Transformation (MDT) to combine features producted by various vision encoders
  • Mixture of Language Experts: to avoid duplication of expesnive FFNs, the paper implements LoRA instead

Github Paper: MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knwoeldge to Pre-trained Language Models Memories

  • Targets the problem of where knowledge is stored by directly targeting the FFN. Decouples the FFN into an original "old-domain" FFN and a parallel "new-domain" adapter to prevent forgetting. Mixture-of-Adapters (MoA) gate acts as a router, deciding how much knowledge to pull from original FFN and the adapter.
  • Two stage training: 1. Domain knowledge injection (training domain adapters while keeping the original LLM/FFN frozen) 2. Task adaptation
  • Look into: L2 sampling loss

Paper

Mitigating Forgetting in Continuously Pre-training MoE-LLMs by Adding and Chilling Experts

  • Add new experts, reduce the learning rate of the old experts

Paper

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Github Paper

Miscelanneous Papers

Towards Execution-Grounded Automated AI Research

  • autoamted AI research by grounding LLM-generated ideas in high-throughput GPUY execution to verify effectiveness in realistic pre-training and post-training envs
  • reinforcement learning from execution rewards successfully increases average idea quality but fails to improve the upper bound of scientific discover — models tend to converge on simple and safe ideas, collapsing in thinking diversity

Github Paper: Towards Execution-Grounded Automated AI Research

Token-Level LLM Collaboration via FusionRoute

  • Complementary Routing Mechanism: token level collaboration between LLM that combines expert selection with a corrective mechanism. First, a lightweight router identifies the best expert for the current token based on conversation hidden states. Second, a complementary logit from the router is added to the expert's predictions.

Github - To be released in the next month Paper: Token-Level LLM Collaboration via FusionRoute