Weight-Space Geometry of Offline Reasoning Training
arXiv:2606.23740v1 Announce Type: new Abstract: Offline reinforcement-learning losses (RFT, RIFT, DFT, Offline GRPO, DPO) are widely used to distill reasoning from large teachers into smaller students, and are typically compared on downstream accuracy alone. We ask whether they are mechanistically distinct or converge to a similar weight update. Training six methods (SFT, RFT, DFT, RIFT, Offline GRPO, DPO) on identical math rollouts from a single base model (Qwen3-4B) with attention-only LoRA, w
延伸阅读
相关资讯
Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction
今天One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tuebingen, with a Parameter-Free Compression Baseline
今天Exploring Dualistic Meta-Learning to Enhance Domain Generalization in Open Set Scenarios
今天Synergizing Physically Constrained MCMC and Chemical-Informed Gaussian Processes for Reaction Network Discovery
今天