Reading List · Active · 2026.06

Efficient Optimizers — Paper List

Memory- and compute-efficient optimizers for deep learning and LLM training.


Overview

  1. Gradient Descent Happens in a Tiny Subspace. <arXiv 2018.12>
  2. From SGD to Spectra: A Theory of Neural Network Weight Dynamics. <arXiv 2025.7>
  3. Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster. <arXiv 2023.4>

Pre-Training

  1. Fantastic Pretraining Optimizers and Where to Find Them. <arXiv 2025.9>
  2. Benchmarking Optimizers for Large Language Model Pretraining. <arXiv 2025.7>
  3. Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales. <arXiv 2025.12>
  4. When is Warmstarting Effective for Scaling Language Models? <arXiv 2026.5>

Post-Training

  1. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less. <arXiv 2026.5>
  2. Can Muon Fine-tune Adam-Pretrained Models? <arXiv 2026.5>

Adam and its Variants

  1. Adam: A Method for Stochastic Optimization. ICLR 2015.
    • Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training. <ICLR 2023 Reject>
    • Adam-mini: Use Fewer Learning Rates to Gain More. ICLR 2025. <arXiv 2024.6>
    • Pre-Training LLMs on a Budget: A Comparison of Three Optimizers. <arXiv 2025.7>
    • The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training. <ICML 2025>

Modern Optimizers

  1. AdaGrad: Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR 2011.
    • AdaHessian: An Adaptive Second Order Optimizer for Machine Learning. <AAAI 2021>
    • Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. <ICML 2018>
      • Scaling Vision Transformers. CVPR 2022.
      • Deconstructing What Makes a Good Optimizer for Language Models. CoLR 2024.
  2. K-FAC: Optimizing Neural Networks with Kronecker-factored Approximate Curvature. <ICML 2015>
    • Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation. <ICLR 2023>
  3. Lion: Symbolic Discovery of Optimization Algorithms. <NIPS 2023>
    • OLion: Approaching the Hadamard Ideal by Intersecting Spectral and Implicit Biases. <arXiv 2026.2>
  4. Sophia: A Scalable Stochastic Second-Order Optimizer for Language Model Pre-Training. <ICLR 2024>
  5. MARS: Unleashing the Power of Variance Reduction for Training Large Models. <arXiv 2025.2>
  6. Preconditioned Inexact Stochastic ADMM for Deep Models. <Nature Machine Intelligence>

Spectral Optimizers

  1. Shampoo: Preconditioned Stochastic Tensor Optimization. <ICML 2018>
    • Scalable Second Order Optimization for Deep Learning. <arXiv 2020>
    • A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale. <arXiv 2023.9>
      • DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers. <arXiv 2026.2>
    • When Does Second-Order Optimization Speed Up Training? <ICLR 2024 Tiny>
    • SOAP: Improving and Stabilizing Shampoo using Adam. <NIPS 2024 Workshop> <arXiv>
      • Improving SOAP using Iterative Whitening and Muon. <GitHub 2025>
    • Conda: Column-Normalized Adam for Training Large Language Models Faster. <arXiv 2025.9>
  2. Muon: An Optimizer for the Hidden Layers of Neural Networks. <GitHub 2024> <Blog>
    • Muon is Scalable for LLM Training. <arXiv 2025.2>
    • Practical Efficiency of Muon for Pretraining. <arXiv 2025.5>
    • Follow-up Optimizers:
      • AdaMuon: Adaptive Muon Optimizer. <arXiv 2025.7>
      • Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training. <arXiv 2026.1>
      • The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm. <ICLR 2026 Oral>
      • HTMuon: Improving Muon via Heavy-Tailed Spectral Correction. <arXiv 2026.3>
      • Muon²: Boosting MUON via Adaptive Second-Moment Preconditioning. <arXiv 2026.4>
      • AMO: Adaptive Muon Orthogonalization. <arXiv 2026.5>
    • Theory / Analytics:
      • Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? <arXiv 2025.11>
      • Preconditioning Benefits of Spectral Orthogonalization in Muon. <arXiv 2026.1>
      • Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence. <arXiv 2026.5>
      • Muon is Not That Special: Random or Inverted Spectra Work Just as Well. <arXiv 2026.5>

Optimizer Wrappers / Decorators

  1. Cautious Optimizers: Improving Training with One Line of Code. <arXiv 2024.11>
  2. GradPower: Powering Gradients for Faster Language Model Pre-Training. <arXiv 2025.5>
  3. On Surprising Effectiveness of Masking Updates in Adaptive Optimizers. <arXiv 2026.2>

Norm-Constrained Optimization

  1. μP: A Spectral Condition for Feature Learning. <arXiv 2023.10>
  2. Scion: Training Deep Learning Models with Norm-Constrained LMOs. <ICML 2025 Spotlight>
    • Generalized Gradient Norm Clipping & Non-Euclidean (L0, L1)-Smoothness. <NIPS 2025 Oral>
    • Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise. <arXiv 2026.5>
  3. Swan: SGD with Normalization and Whitening Enables Stateless LLM Training. <arXiv 2024.12>
    • SinkGD: Gradient Multi-Normalization for Stateless and Scalable LLM Training. <arXiv 2025.2>
    • ARO: A New Lens On Matrix Optimization For Large Models. <arXiv 2026.2>
  4. On the Width Scaling of Neural Optimizers Under Matrix Operator Norms. <arXiv 2026.3>
    • RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization. <arXiv 2026.3>
    • MUON+: Towards Better Muon via One Additional Normalization Step. <arXiv 2026.2>
    • Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer. <arXiv 2026.5>
    • Normuon: Making Muon More Efficient and Scalable. <ICML 2026 Spotlight>
      • Aurora: A Leverage-Aware Optimizer for Rectangular Matrices. <Blog 2026.5>

Weight Norm Control

  1. Hyperball Optimization. <Notion 2026.1>
    • Rethinking Language Model Scaling under Transferable Hypersphere Optimization. <arXiv 2026.4>
  2. SSO: Controlled LLM Training on Spectral Sphere. <arXiv 2026.1>
    • MCSD: Manifold Constrained Steepest Descent. <arXiv 2026.1>
    • Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise. <arXiv 2026.5>
  3. Muown: Row-Norm Control for Muon Optimization. <arXiv 2026.5>

Manifold & Architecture-Optimizer Co-design

  1. Modular Manifolds. <Jeremy Bernstein 2025.9>
  2. Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers. <arXiv 2026.5>

Low-Rank Subspace Optimizers

  1. APOLLO: SGD-like Memory, AdamW-level Performance. <arXiv 2024.12>
  2. Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation. <ICLR 2026 Oral>
  3. A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models. ICML 2025. <arXiv 2025.2>
  4. NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training. <arXiv 2026.3>

LoRA / GaLoRA

  1. InRank: Incremental Low-Rank Learning. <arXiv 2023.6>
  2. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. <ICML 2024>
    • GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection. <arXiv 2025.4>
    • LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics. <ICLR 2025>
    • SEPARATE: A Simple Low-rank Projection for Gradient Compression in Modern Large-scale Model Training Process. <ICLR 2025>
  3. Mixture-of-Subspaces in Low-Rank Adaptation. <arXiv 2024.6>
  4. On the Optimization Landscape of Low-Rank Adaptation Methods for Large Language Models. <ICLR 2025>
  5. Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment. <arXiv 2025.2>
  6. Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? <arXiv 2024.10>
  7. FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training. <ICML 2025>
  8. LoRA Training Provably Converges to a Low-Rank Global Minimum or It Fails Loudly (But it Probably Won’t Fail). <ICML 2025 Oral>
  9. Riemannian Optimization for LoRA on the Stiefel Manifold. <arXiv 2025.8>
  10. QR-LoRA: QR-Based Low-Rank Adaptation for Efficient Fine-Tuning of Large Language Models. <arXiv 2025.8>

← Back to all reading lists