Reading List · Archived · 2025.09

Data-centric Methods — Paper List

Data valuation, attribution, selection, pruning, and synthesis for deep learning and LLMs.


Data Valuation / Data Attribution

  1. Influence Functions:
    • General:
      • Understanding Black-box Predictions via Influence Functions. Koh, 2017. <pdf>
      • Estimating Training Data Influence by Tracing Gradient Descent. Garima, 2020. <pdf>
      • Multi-Stage Influence Function. Chen, 2020. <pdf>
      • Datamodels: Predicting Predictions from Training Data. 2022.
      • TRAK: Attributing Model Behavior at Scale. <ICML 2023>
      • Studying Large Language Model Generalization with Influence Functions. Grosse, 2023. <pdf>
      • Channel-wise Influence: Effective Data Influence Estimation for Multivariate Time Series. Wang, 2024. <pdf>
      • Scaling Laws for the Value of Individual Data Points in Machine Learning. Covert, 2024. <ICML 2024> <pdf>
      • The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes. Ko, 2024. <CVPR 2024> <pdf>
      • Automated Efficient Estimation using Monte Carlo Efficient Influence Functions. <NIPS 2024>
      • Enhancing Training Robustness through Influence Measure. <ICLR 2025>
      • Capturing the Temporal Dependence of Training Data Influence. <ICLR 2025 Oral>
    • For LLM Pretraining:
      • What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions. Choe, 2024. <arXiv> <pdf>
      • Self-Influence Guided Data Reweighting for Language Model Pre-training. Thakkar. <EMNLP 2023> <pdf>
      • MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models. <NIPS 2024>
      • Harnessing Diversity for Important Data Selection in Pretraining Large Language Models. <ICLR 2025 Spotlight> <pdf>
      • Scalable Influence and Fact Tracing for Large Language Model Pretraining. <ICLR 2025> <pdf>
      • AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection. <arXiv 2025.5>
    • For LLM Fine-tuning:
      • Empirical Influence Functions to Understand the Logic of Fine-tuning. Matelsky, 2024. <pdf>
      • In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models. Joaquin, 2024. <pdf>
      • IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners In Large Language Models. Zhang. <ICLR 2024>
      • DATAINF: Efficiently Estimating Data Influence in LoRA-Tuned LLMs and Diffusion Models. Kwon. <ICLR 2024> <pdf>
      • LESS: Selecting Influential Data for Targeted Instruction Tuning. Xia. <ICLR 2024 Workshop> <pdf>
    • For LLM Reasoning:
      • What Kind of Pretraining Data Do Large Language Models Rely on When Doing Reasoning? <ICLR 2025>
        • Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models. <arXiv 2024.11>
      • Influence Functions for Efficient Data Selection in Reasoning.
    • ICLR 2025 Withdrawn / Reject:
      • Do Influence Functions Work on Large Language Models? <pdf>
      • Large-scale Training Data Attribution with Efficient Influence Functions. <pdf>
      • Understanding Impact of Human Feedback via Influence Functions. <pdf>
      • Revisit, Extend, and Enhance Hessian-free Influence Functions. <pdf>
      • Revisiting Inverse Hessian Vector Products for Calculating Influence Functions. <pdf>
  2. Data Behaviour in Training:
    • An Empirical Study of Example Forgetting During Deep Neural Network Learning. ICLR 2019. <arXiv 2018.12>
    • Deep Learning on a Data Diet: Finding Important Examples Early in Training. <NIPS 2021>
    • Deep Learning Through the Lens of Example Difficulty. <NIPS 2021>
  3. Shapley Value:
    • Data Shapley: Equitable Valuation of Data for Machine Learning. <ICML 2019>
    • Towards Efficient Data Valuation Based on the Shapley Value. <ICML 2019>
    • Data Shapley in One Training Run. <ICLR 2025 Oral>
  4. LLM Applications / Techniques:
    • Self-Influence Guided Data Reweighting for Language Model Pre-training. EMNLP 2023.
    • Entropy-based Adaptive Weighting for Self-Training.

Data Selection / Dataset Pruning

  1. Theoretical Studies / Methodology:
    • Data Pruning via Moving-one-Sample-out. Tan. <NIPS 2023> <pdf>
    • Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning. <NIPS 2023>
    • Dataset Pruning: Reducing Training Data by Examining Generalization Influence. <ICLR 2023>
  2. LLM Pretraining:
    • Notable Survey:
      • Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora. <arXiv 2025.4>
    • Difficulty:
      • A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs. <arXiv 2024.10>
      • Data Selection for Language Models via Importance Resampling. <NIPS 2023>
      • QuRating: Selecting High-Quality Data for Training Language Models. <ICML 2024>
      • Rho-1: Not All Tokens Are What You Need. <NIPS 2024 Oral> <arXiv 2024.4>
      • Improving Pretraining Data Using Perplexity Correlations. <ICLR 2025>
      • Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models. <ICLR 2025>
      • Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining. <ICLR 2025>
      • Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws. <ICLR 2025>
      • Predictive Data Selection: The Data That Predicts Is the Data That Teaches. <arXiv 2025.3>
    • Diversity:
      • D4: Improving LLM Pretraining via Document De-Duplication and Diversification. <NIPS 2023>
      • DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. <NIPS 2023>
        • ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection. <arXiv 2025.4>
      • When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale. <NIPS 2023 Workshop>
      • Combatting Dimensional Collapse in LLM Pre-Training Data via Submodular File Selection. <ICLR 2025 Oral>
      • Harnessing Diversity for Important Data Selection in Pretraining Large Language Models. <ICLR 2025 Spotlight> <arXiv 2024.9>
      • DataMan: Data Manager for Pre-training Large Language Models. <ICLR 2025>
      • Enhancing Multilingual LLM Pretraining with Model-Based Data Selection. <arXiv 2025.2>
      • Data Differences over Scale (DataDos) Suite: How to Predict Best Pretraining Data with Small Experiments. <ICML 2025>
  3. LLM Fine-tuning / Alignment:
    • LESS: Selecting Influential Data for Targeted Instruction Tuning. <ICML 2024 Workshop>
    • Improving Data Efficiency via Curating LLM-Driven Rating Systems. <ICLR 2025> <arXiv 2024.10>
    • Do We Really Have to Filter Out Random Noise in Pre-training Data for Language Models? <ACL ARR 2024>
    • Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples. ICML 2025.
    • The Best Instruction-Tuning Data are Those That Fit. <arXiv 2025.2>
    • RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? <arXiv 2025.1>
    • Large-Scale Data Selection for Instruction Tuning. <arXiv 2025.3>
    • Reverse Modeling in Large Language Models. <arXiv 2024.10>
  4. LLM Reinforcement Learning / Reasoning:
    • Entropy-guided Sequence Weighting for Efficient Exploration in RL-based LLM Fine-tuning.
    • TwT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers’ Guidance. <arXiv 2025.3>
    • ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning. <arXiv 2025.4>
    • Efficient Reinforcement Finetuning via Adaptive Curriculum Learning. <arXiv 2025.4>
    • How Instruction and Reasoning Data Shape Post-Training: Data Quality through the Lens of Layer-wise Gradients. <arXiv 2025.4>
    • Rethinking the Generation of High-Quality CoT Data from the Perspective of LLM-Adaptive Question Difficulty Grading. <arXiv 2025.4>
    • AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners. <arXiv 2025.5>
  5. Online Data Selection:
    • Accelerating Deep Learning with Dynamic Data Pruning. <arXiv 2021.11>
    • Learned Token Pruning for Transformers. <arXiv 2021.7>
    • InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning. <ICLR 2024>
    • GREATS: Online Selection of High-Quality Data for LLM Training in Every Iteration. <NIPS 2024 Spotlight>

Synthetic Data / Dataset Distillation

  1. CNN / Diffusion:
    • Dataset Distillation. <arXiv 2018>
    • Dataset Condensation with Gradient Matching. <ICLR 2021 Oral>
    • Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale from a New Perspective. <NIPS 2023>
    • Dataset Diffusion: Diffusion-based Synthetic Data Generation for Pixel-Level Semantic Segmentation. <NIPS 2023>
    • Elucidating the Design Space of Dataset Condensation. NIPS 2024. <arXiv 2024.4>
    • Distilling Dataset into Neural Field.
  2. Transformer / LLM:
    • Farzi Data: Autoregressive Data Distillation. <arXiv 2023.10> <ICLR 2024 Reject>
    • DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows. <arXiv 2024.2>
    • Best Practices and Lessons Learned on Synthetic Data. <arXiv 2024.4>
    • The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks. <EACL 2024>
    • Large Language Models for Data Annotation and Synthesis: A Survey. <EMNLP 2024>
    • Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective. <ICLR 2025> <arXiv>
    • Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning. <ICLR 2025> <arXiv>
    • DataGen: Unified Synthetic Dataset Generation via Large Language Models. <ICLR 2025>
    • Synthetic Continued Pretraining. <ICLR 2025 Oral>
    • Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use. <arXiv 2025.4>
    • Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning.
    • FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline. <EMNLP 2025>
    • DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning. <arXiv 2025.8>
  3. MultiModal / VLLM:
    • StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data. <arXiv 2023.8>
    • LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models. <ICLR 2025>
    • Unicorn: Text-Only Data Synthesis for Vision Language Model Training.
    • Token Sequence Compression for Efficient Multimodal Computing. arXiv 2025.4.

← Back to all reading lists