Reading List · Active · 2026.06

Efficient Optimizers — Paper List

Memory- and compute-efficient optimizers for deep learning and LLM training.

Overview

Gradient Descent Happens in a Tiny Subspace. <arXiv 2018.12>
From SGD to Spectra: A Theory of Neural Network Weight Dynamics. <arXiv 2025.7>
Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster. <arXiv 2023.4>

Pre-Training

Fantastic Pretraining Optimizers and Where to Find Them. <arXiv 2025.9>
Benchmarking Optimizers for Large Language Model Pretraining. <arXiv 2025.7>
Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales. <arXiv 2025.12>
When is Warmstarting Effective for Scaling Language Models? <arXiv 2026.5>

Post-Training

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less. <arXiv 2026.5>
Can Muon Fine-tune Adam-Pretrained Models? <arXiv 2026.5>

Adam and its Variants

Adam: A Method for Stochastic Optimization. ICLR 2015.
- Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training. <ICLR 2023 Reject>
- Adam-mini: Use Fewer Learning Rates to Gain More. ICLR 2025. <arXiv 2024.6>
- Pre-Training LLMs on a Budget: A Comparison of Three Optimizers. <arXiv 2025.7>
- The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training. <ICML 2025>

Modern Optimizers

AdaGrad: Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR 2011.
- AdaHessian: An Adaptive Second Order Optimizer for Machine Learning. <AAAI 2021>
- Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. <ICML 2018>
  - Scaling Vision Transformers. CVPR 2022.
  - Deconstructing What Makes a Good Optimizer for Language Models. CoLR 2024.
K-FAC: Optimizing Neural Networks with Kronecker-factored Approximate Curvature. <ICML 2015>
- Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation. <ICLR 2023>
Lion: Symbolic Discovery of Optimization Algorithms. <NIPS 2023>
- OLion: Approaching the Hadamard Ideal by Intersecting Spectral and Implicit Biases. <arXiv 2026.2>
Sophia: A Scalable Stochastic Second-Order Optimizer for Language Model Pre-Training. <ICLR 2024>
MARS: Unleashing the Power of Variance Reduction for Training Large Models. <arXiv 2025.2>
Preconditioned Inexact Stochastic ADMM for Deep Models. <Nature Machine Intelligence>

Spectral Optimizers

Shampoo: Preconditioned Stochastic Tensor Optimization. <ICML 2018>
- Scalable Second Order Optimization for Deep Learning. <arXiv 2020>
- A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale. <arXiv 2023.9>
  - DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers. <arXiv 2026.2>
- When Does Second-Order Optimization Speed Up Training? <ICLR 2024 Tiny>
- SOAP: Improving and Stabilizing Shampoo using Adam. <NIPS 2024 Workshop> <arXiv>
  - Improving SOAP using Iterative Whitening and Muon. <GitHub 2025>
- Conda: Column-Normalized Adam for Training Large Language Models Faster. <arXiv 2025.9>
Muon: An Optimizer for the Hidden Layers of Neural Networks. <GitHub 2024> <Blog>
- Muon is Scalable for LLM Training. <arXiv 2025.2>
- Practical Efficiency of Muon for Pretraining. <arXiv 2025.5>
- Follow-up Optimizers:
  - AdaMuon: Adaptive Muon Optimizer. <arXiv 2025.7>
  - Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training. <arXiv 2026.1>
  - The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm. <ICLR 2026 Oral>
  - HTMuon: Improving Muon via Heavy-Tailed Spectral Correction. <arXiv 2026.3>
  - Muon²: Boosting MUON via Adaptive Second-Moment Preconditioning. <arXiv 2026.4>
  - AMO: Adaptive Muon Orthogonalization. <arXiv 2026.5>
- Theory / Analytics:
  - Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? <arXiv 2025.11>
  - Preconditioning Benefits of Spectral Orthogonalization in Muon. <arXiv 2026.1>
  - Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence. <arXiv 2026.5>
  - Muon is Not That Special: Random or Inverted Spectra Work Just as Well. <arXiv 2026.5>

Optimizer Wrappers / Decorators

Cautious Optimizers: Improving Training with One Line of Code. <arXiv 2024.11>
- Cautious Weight Decay. <arXiv 2025.10>
GradPower: Powering Gradients for Faster Language Model Pre-Training. <arXiv 2025.5>
On Surprising Effectiveness of Masking Updates in Adaptive Optimizers. <arXiv 2026.2>

Norm-Constrained Optimization

μP: A Spectral Condition for Feature Learning. <arXiv 2023.10>
Scion: Training Deep Learning Models with Norm-Constrained LMOs. <ICML 2025 Spotlight>
- Generalized Gradient Norm Clipping & Non-Euclidean (L0, L1)-Smoothness. <NIPS 2025 Oral>
- Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise. <arXiv 2026.5>
Swan: SGD with Normalization and Whitening Enables Stateless LLM Training. <arXiv 2024.12>
- SinkGD: Gradient Multi-Normalization for Stateless and Scalable LLM Training. <arXiv 2025.2>
- ARO: A New Lens On Matrix Optimization For Large Models. <arXiv 2026.2>
On the Width Scaling of Neural Optimizers Under Matrix Operator Norms. <arXiv 2026.3>
- RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization. <arXiv 2026.3>
- MUON+: Towards Better Muon via One Additional Normalization Step. <arXiv 2026.2>
- Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer. <arXiv 2026.5>
- Normuon: Making Muon More Efficient and Scalable. <ICML 2026 Spotlight>
  - Aurora: A Leverage-Aware Optimizer for Rectangular Matrices. <Blog 2026.5>

Weight Norm Control

Hyperball Optimization. <Notion 2026.1>
- Rethinking Language Model Scaling under Transferable Hypersphere Optimization. <arXiv 2026.4>
SSO: Controlled LLM Training on Spectral Sphere. <arXiv 2026.1>
- MCSD: Manifold Constrained Steepest Descent. <arXiv 2026.1>
- Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise. <arXiv 2026.5>
Muown: Row-Norm Control for Muon Optimization. <arXiv 2026.5>

Manifold & Architecture-Optimizer Co-design

Modular Manifolds. <Jeremy Bernstein 2025.9>
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers. <arXiv 2026.5>

Low-Rank Subspace Optimizers

APOLLO: SGD-like Memory, AdamW-level Performance. <arXiv 2024.12>
Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation. <ICLR 2026 Oral>
A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models. ICML 2025. <arXiv 2025.2>
NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training. <arXiv 2026.3>

LoRA / GaLoRA

InRank: Incremental Low-Rank Learning. <arXiv 2023.6>
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. <ICML 2024>
- GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection. <arXiv 2025.4>
- LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics. <ICLR 2025>
- SEPARATE: A Simple Low-rank Projection for Gradient Compression in Modern Large-scale Model Training Process. <ICLR 2025>
Mixture-of-Subspaces in Low-Rank Adaptation. <arXiv 2024.6>
On the Optimization Landscape of Low-Rank Adaptation Methods for Large Language Models. <ICLR 2025>
Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment. <arXiv 2025.2>
Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? <arXiv 2024.10>
FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training. <ICML 2025>
LoRA Training Provably Converges to a Low-Rank Global Minimum or It Fails Loudly (But it Probably Won’t Fail). <ICML 2025 Oral>
Riemannian Optimization for LoRA on the Stiefel Manifold. <arXiv 2025.8>
QR-LoRA: QR-Based Low-Rank Adaptation for Efficient Fine-Tuning of Large Language Models. <arXiv 2025.8>

← Back to all reading lists