Reading List · Active · 2026.05

LLM Post Training & Reasoning — Paper List

Methodology, techniques, and problem settings for LLM reasoning, alignment, efficiency, latent reasoning, attention sinks, and quantization.


Methodology / Techniques

  1. Prompts / Thoughts Engineering:
    • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. <NIPS 2022>
    • Tree of Thoughts: Deliberate Problem Solving with Large Language Models. <NIPS 2023>
    • Graph of Thoughts: Solving Elaborate Problems with Large Language Models. <AAAI 2024>
    • Chain-of-Thought Reasoning Without Prompting. <NeurIPS 2024>
    • Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought. <NIPS 2024 Oral>
    • Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models. <NIPS 2024 Spotlight>
  2. Retrieval Augmented Generation (RAG):
    • Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse. <ICLR 2025 Oral>
  3. Test-Time Training:
    • Test-Time Training with Self-Supervision for Generalization under Distribution Shifts. <ICML 2020>
    • Towards Understanding Multi-Task Learning (Generalization) of LLMs via Detecting and Exploring Task-Specific Neurons. <arXiv 2024>
    • The Surprising Effectiveness of Test-Time Training for Abstract Reasoning. <arXiv 2024>
    • LIMA: Less Is More for Alignment. <NIPS 2023>
  4. Test-Time Scaling:
  5. Long / Short CoT:
    • Demystifying Long Chain-of-Thought Reasoning in LLMs. <arXiv 2025.2>
    • Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models. <arXiv 2025.2>
    • When More is Less: Understanding Chain-of-Thought Length in LLMs. <arXiv 2025.2>
    • TokenSkip: Controllable Chain-of-Thought Compression in LLMs. <arXiv 2025.2>
    • How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach. <arXiv 2025.3>
    • DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models. <arXiv 2025.3>
    • Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging. <arXiv 2025.3>
    • Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length? <arXiv 2025.4>
    • Condensed Reasoning Prompting: Efficient Strategies, Evaluations, and Trade Offs in Large Language Model Reasoning.
    • Dynamic Early Exit in Reasoning Models. arXiv 2025.4.
  6. Latent Reasoning:
    • COCONUT: Training Large Language Models to Reason in a Continuous Latent Space. <arXiv 2024.12>
    • LLMs Do Not Think Step-by-step In Implicit Reasoning. <arXiv 2024.11>
    • Training Large Language Models to Reason in a Continuous Latent Space. <ICLR 2025 Reject>
    • Reasoning with Latent Thoughts: On the Power of Looped Transformers. <ICLR 2025>
    • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. <arXiv 2025.2>
    • Reasoning Models Don’t Always Say What They Think. <2025.4>
    • Reasoning Models Can Be Effective Without Thinking. <arXiv 2025.4>
    • Enhancing Latent Computation in Transformers with Latent Tokens.
    • Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space.
    • Continuous Chain of Thought Enables Parallel Exploration and Reasoning. <arXiv 2025.5>
    • Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models. <arXiv 2025.8>
  7. Reinforcement Learning:
    • DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL. <Notion 2025>
    • Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning. <arXiv 2025.2>
    • Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t. <arXiv 2025.3>
    • AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning. <arXiv 2025.5>
    • Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims. <Notion 2025.5>
      • Reinforcement Learning for Reasoning in Large Language Models with One Training Example. <arXiv 2025.4>
      • Learning to Reason without External Rewards.
      • Can Large Reasoning Models Self-Train?
      • Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers.
      • The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning. <arXiv 2025.5>
    • Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections. <arXiv 2025.7>
  8. LLM Agent:
    • Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors. <arXiv 2025.4>
  9. Quantization:
    • Quantitative Analysis of Performance Drop in DeepSeek Model Quantization. <arXiv 2025.5>
    • An Empirical Study of Qwen3 Quantization. <arXiv 2025.5>
    • Restructuring Vector Quantization with the Rotation Trick. <ICLR 2025>
    • GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers. <NIPS 2025>
    • Improving the Straight-Through Estimator with Zeroth-Order Information. <NIPS 2025>
    • ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization. <arXiv 2025.10>
    • Lotion: Smoothing the Optimization Landscape for Quantized Training. <arXiv 2025.10>
    • CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training. <arXiv 2025.10>
    • Outlier Smoothing with Closed-Form Rotations for W4A4 Large Language Model Quantization. <arXiv 2025.11>
    • Towards Quantization-Aware Training for Ultra-Low-Bit Reasoning LLMs. <ICLR 2026>
    • Compute-Optimal Quantization-Aware Training. <ICLR 2026>
    • MixQuant: Pushing the Limits of Block Rotations in Post-Training Quantization. <arXiv 2026.1>
    • HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs. <arXiv 2026.1>
    • D²Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs. <arXiv 2026.2>
    • WinQ: Accelerating Quantization-Aware Training of Language Models Around Saddle Points. <arXiv 2026.5>

Targets / Problem Settings

  1. Trustworthiness / Hallucination (Detection / Mitigation) by Entropy:
    • HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection. <NIPS 2024 Spotlight>
    • Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs. <arXiv 2024>
    • LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations. <ICLR 2025>
    • NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models. <ICLR 2025>
    • DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. <ICLR 2025>
    • Teaching Language Models to Hallucinate Less with Synthetic Tasks. <ICLR 2025>
    • INSIDE: LLMs’ Internal States Retain the Power of Hallucination Detection. <ICLR 2025>
    • Improving Reasoning Performance in Large Language Models via Representation Engineering. <ICLR 2025>
    • Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization. arXiv 2025.4.
    • Robust Hallucination Detection in LLMs via Adaptive Token Selection. <arXiv 2025.4>
    • TruthFlow: Truthful LLM Generation via Representation Flow Correction. ICML 2025. <arXiv 2025.2>
    • The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models.
    • How to Steer LLM Latents for Hallucination Detection? ICML 2025.
    • Can LLMs Lie? Investigation beyond Hallucination.
    • Why Language Models Hallucinate. <OpenAI 2025.9>
  2. Alignment / Instruction Following:
    • Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To. <ICLR 2025 Oral>
    • RAIN: Your Language Models Can Align Themselves without Finetuning. <ICLR 2025>
  3. Reward Model / LLM as a Judge:
    • Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge.
    • Self-Preference Bias in LLM-as-a-Judge. <arXiv 2024.10>
  4. Multi-task:
    • CoBa: Convergence Balancer for Multitask Finetuning of Large Language Models. <EMNLP 2024 Oral>
    • Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning. <ICLR 2025 Oral>
    • Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling. <ICLR 2025>
    • MTLoRA: Low-Rank Adaptation Approach for Efficient Multi-Task Learning. <CVPR 2024>
    • Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning. <arXiv 2025.7>
  5. Representation Engineering:
    • Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers. <arXiv 2023.12>
    • Does Representation Matter? Exploring Intermediate Layers in Large Language Models. <arXiv 2024.12>
    • Layer by Layer: Uncovering Hidden Representations in Language Models. <arXiv 2025.2>
    • No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves. <arXiv 2025.5>
  6. Theorem Proving:
    • LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation.

Attention Sinks & Massive Values

  1. Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding. <arXiv 2025.2>
  2. What Drives Attention Sinks? A Study of Massive Activations and Rotational Positional Encoding in Large Vision-Language Models. <IPM 2026>
  3. Context Tokens are Anchors: Understanding the Repeat Curse in dMLLMs from an Information Flow Perspective. <ICLR 2026>
  4. Deconstructing Positional Information: From Attention Logits to Training Biases. <ICLR 2026>
  5. Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers. <arXiv 2025.10>
  6. The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks. <arXiv 2026.3, Yann LeCun>
  7. Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks. <arXiv 2026.3>

Latent Reasoning for AR/Diffusion-LLMs

  1. Auto-Regressive LLMs:
    1. COCONUT: Training Large Language Models to Reason in a Continuous Latent Space. <arXiv 2024.12> <ICLR 2025 Reject>
    2. Deliberation in Latent Space via Differentiable Cache Augmentation. <arXiv 2024.12>
    3. SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs. <arXiv 2025.2>
    4. CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation. <arXiv 2025.2>
    5. Reasoning with Latent Thoughts: On the Power of Looped Transformers. <ICLR 2025>
    6. Reasoning to Learn from Latent Thoughts. <arXiv 2025.3>
    7. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. <arXiv 2025.2> <NIPS 2025>
    8. Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains. <arXiv 2025.5>
    9. Enhancing Latent Computation in Transformers with Latent Tokens. <arXiv 2025.5>
    10. Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space. <arXiv 2025.5>
    11. Continuous Chain of Thought Enables Parallel Exploration and Reasoning. <arXiv 2025.5>
    12. Latent Reasoning in LLMs as a Vocabulary-Space Superposition. <arXiv 2025.10>
    13. LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning. <arXiv 2025.10>
    14. CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning. <arXiv 2025.11>
    15. Hybrid Latent Reasoning via Reinforcement Learning. <NIPS 2025 Spotlight>
  2. Diffusion LLMs:
    1. Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner. <arXiv 2025.10>
    2. Soft-Masked Diffusion Language Models. <arXiv 2025.10>
  3. Related:
    1. LLMs Do Not Think Step-by-step In Implicit Reasoning. <arXiv 2024.11>
    2. Reasoning Models Don’t Always Say What They Think. <arXiv 2025.4>
    3. Reasoning Models Can Be Effective Without Thinking. <arXiv 2025.4>

Token-Level Iterative Refinement (AR/Diffusion)

  1. Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning. <arXiv 2025.2>
  2. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation. <arXiv 2025.7>
  3. Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models. <NIPS 2025>
  4. Remasking Discrete Diffusion Models with Inference-Time Scaling. <NIPS 2025>
  5. Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models. <arXiv 2025.11>
  6. Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes. <ICLR 2026 Review>
  7. Don’t Settle Too Early: Self-Reflective Remasking for Diffusion Language Models. <ICLR 2026 Review>
  8. Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States. <arXiv 2025.10> <ICLR 2026 Review>
  9. Learning Unmasking Policies for Diffusion Language Models. <arXiv 2025.12>
  10. dUltra: Ultra-Fast Diffusion Language Models via Reinforcement Learning. <arXiv 2025.12>

Layer Skipping / Mixture-of-Depth

  1. Learning to Skip for Language Modeling. <arXiv 2023.11>
  2. Not All Layers of LLMs are Necessary during Inference. <arXiv 2024.3>
  3. LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning. <NIPS 2024>
  4. Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models. <arXiv 2024.4>
  5. Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models. <arXiv 2024.10> <EMNLP 2024>
  6. Related:
    • Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions. <ICLR 2025 Spotlight>
    • Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective. <ICML 2025 Spotlight>
    • Do Language Models Use Their Depth Efficiently? <arXiv 2025.5>

← Back to all reading lists