Reading List · Active · 2026.06
Efficient Optimizers — Paper List
Memory- and compute-efficient optimizers for deep learning and LLM training.
Overview
- Gradient Descent Happens in a Tiny Subspace. <arXiv 2018.12>
- From SGD to Spectra: A Theory of Neural Network Weight Dynamics. <arXiv 2025.7>
- Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster. <arXiv 2023.4>
Pre-Training
- Fantastic Pretraining Optimizers and Where to Find Them. <arXiv 2025.9>
- Benchmarking Optimizers for Large Language Model Pretraining. <arXiv 2025.7>
- Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales. <arXiv 2025.12>
- When is Warmstarting Effective for Scaling Language Models? <arXiv 2026.5>
Post-Training
- Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less. <arXiv 2026.5>
- Can Muon Fine-tune Adam-Pretrained Models? <arXiv 2026.5>
Adam and its Variants
- Adam: A Method for Stochastic Optimization. ICLR 2015.
- Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training. <ICLR 2023 Reject>
- Adam-mini: Use Fewer Learning Rates to Gain More. ICLR 2025. <arXiv 2024.6>
- Pre-Training LLMs on a Budget: A Comparison of Three Optimizers. <arXiv 2025.7>
- The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training. <ICML 2025>
Modern Optimizers
- AdaGrad: Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR 2011.
- AdaHessian: An Adaptive Second Order Optimizer for Machine Learning. <AAAI 2021>
- Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. <ICML 2018>
- Scaling Vision Transformers. CVPR 2022.
- Deconstructing What Makes a Good Optimizer for Language Models. CoLR 2024.
- K-FAC: Optimizing Neural Networks with Kronecker-factored Approximate Curvature. <ICML 2015>
- Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation. <ICLR 2023>
- Lion: Symbolic Discovery of Optimization Algorithms. <NIPS 2023>
- OLion: Approaching the Hadamard Ideal by Intersecting Spectral and Implicit Biases. <arXiv 2026.2>
- Sophia: A Scalable Stochastic Second-Order Optimizer for Language Model Pre-Training. <ICLR 2024>
- MARS: Unleashing the Power of Variance Reduction for Training Large Models. <arXiv 2025.2>
- Preconditioned Inexact Stochastic ADMM for Deep Models. <Nature Machine Intelligence>
Spectral Optimizers
- Shampoo: Preconditioned Stochastic Tensor Optimization. <ICML 2018>
- Scalable Second Order Optimization for Deep Learning. <arXiv 2020>
- A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale. <arXiv 2023.9>
- DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers. <arXiv 2026.2>
- When Does Second-Order Optimization Speed Up Training? <ICLR 2024 Tiny>
- SOAP: Improving and Stabilizing Shampoo using Adam. <NIPS 2024 Workshop> <arXiv>
- Improving SOAP using Iterative Whitening and Muon. <GitHub 2025>
- Conda: Column-Normalized Adam for Training Large Language Models Faster. <arXiv 2025.9>
- Muon: An Optimizer for the Hidden Layers of Neural Networks. <GitHub 2024> <Blog>
- Muon is Scalable for LLM Training. <arXiv 2025.2>
- Practical Efficiency of Muon for Pretraining. <arXiv 2025.5>
- Follow-up Optimizers:
- AdaMuon: Adaptive Muon Optimizer. <arXiv 2025.7>
- Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training. <arXiv 2026.1>
- The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm. <ICLR 2026 Oral>
- HTMuon: Improving Muon via Heavy-Tailed Spectral Correction. <arXiv 2026.3>
- Muon²: Boosting MUON via Adaptive Second-Moment Preconditioning. <arXiv 2026.4>
- AMO: Adaptive Muon Orthogonalization. <arXiv 2026.5>
- Theory / Analytics:
- Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? <arXiv 2025.11>
- Preconditioning Benefits of Spectral Orthogonalization in Muon. <arXiv 2026.1>
- Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence. <arXiv 2026.5>
- Muon is Not That Special: Random or Inverted Spectra Work Just as Well. <arXiv 2026.5>
Optimizer Wrappers / Decorators
- Cautious Optimizers: Improving Training with One Line of Code. <arXiv 2024.11>
- Cautious Weight Decay. <arXiv 2025.10>
- GradPower: Powering Gradients for Faster Language Model Pre-Training. <arXiv 2025.5>
- On Surprising Effectiveness of Masking Updates in Adaptive Optimizers. <arXiv 2026.2>
Norm-Constrained Optimization
- μP: A Spectral Condition for Feature Learning. <arXiv 2023.10>
- Scion: Training Deep Learning Models with Norm-Constrained LMOs. <ICML 2025 Spotlight>
- Generalized Gradient Norm Clipping & Non-Euclidean (L0, L1)-Smoothness. <NIPS 2025 Oral>
- Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise. <arXiv 2026.5>
- Swan: SGD with Normalization and Whitening Enables Stateless LLM Training. <arXiv 2024.12>
- SinkGD: Gradient Multi-Normalization for Stateless and Scalable LLM Training. <arXiv 2025.2>
- ARO: A New Lens On Matrix Optimization For Large Models. <arXiv 2026.2>
- On the Width Scaling of Neural Optimizers Under Matrix Operator Norms. <arXiv 2026.3>
- RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization. <arXiv 2026.3>
- MUON+: Towards Better Muon via One Additional Normalization Step. <arXiv 2026.2>
- Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer. <arXiv 2026.5>
- Normuon: Making Muon More Efficient and Scalable. <ICML 2026 Spotlight>
- Aurora: A Leverage-Aware Optimizer for Rectangular Matrices. <Blog 2026.5>
Weight Norm Control
- Hyperball Optimization. <Notion 2026.1>
- Rethinking Language Model Scaling under Transferable Hypersphere Optimization. <arXiv 2026.4>
- SSO: Controlled LLM Training on Spectral Sphere. <arXiv 2026.1>
- MCSD: Manifold Constrained Steepest Descent. <arXiv 2026.1>
- Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise. <arXiv 2026.5>
- Muown: Row-Norm Control for Muon Optimization. <arXiv 2026.5>
Manifold & Architecture-Optimizer Co-design
- Modular Manifolds. <Jeremy Bernstein 2025.9>
- Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers. <arXiv 2026.5>
Low-Rank Subspace Optimizers
- APOLLO: SGD-like Memory, AdamW-level Performance. <arXiv 2024.12>
- Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation. <ICLR 2026 Oral>
- A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models. ICML 2025. <arXiv 2025.2>
- NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training. <arXiv 2026.3>
LoRA / GaLoRA
- InRank: Incremental Low-Rank Learning. <arXiv 2023.6>
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. <ICML 2024>
- GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection. <arXiv 2025.4>
- LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics. <ICLR 2025>
- SEPARATE: A Simple Low-rank Projection for Gradient Compression in Modern Large-scale Model Training Process. <ICLR 2025>
- Mixture-of-Subspaces in Low-Rank Adaptation. <arXiv 2024.6>
- On the Optimization Landscape of Low-Rank Adaptation Methods for Large Language Models. <ICLR 2025>
- Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment. <arXiv 2025.2>
- Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? <arXiv 2024.10>
- FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training. <ICML 2025>
- LoRA Training Provably Converges to a Low-Rank Global Minimum or It Fails Loudly (But it Probably Won’t Fail). <ICML 2025 Oral>
- Riemannian Optimization for LoRA on the Stiefel Manifold. <arXiv 2025.8>
- QR-LoRA: QR-Based Low-Rank Adaptation for Efficient Fine-Tuning of Large Language Models. <arXiv 2025.8>