SheepNav
新上线今天0 投票

Weight-Space Geometry of Offline Reasoning Training

arXiv:2606.23740v1 Announce Type: new Abstract: Offline reinforcement-learning losses (RFT, RIFT, DFT, Offline GRPO, DPO) are widely used to distill reasoning from large teachers into smaller students, and are typically compared on downstream accuracy alone. We ask whether they are mechanistically distinct or converge to a similar weight update. Training six methods (SFT, RFT, DFT, RIFT, Offline GRPO, DPO) on identical math rollouts from a single base model (Qwen3-4B) with attention-only LoRA, w

延伸阅读

  1. Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction
  2. One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tuebingen, with a Parameter-Free Compression Baseline
  3. Exploring Dualistic Meta-Learning to Enhance Domain Generalization in Open Set Scenarios
查看原文