Reinforcement Learning Towards Broadly and Persistently Beneficial Models
arXiv:2606.24014v1 Announce Type: new Abstract: As AI systems are deployed across increasingly diverse and high-stakes settings, model alignment must generalize beyond the tasks and domains seen during training. This is especially important for reinforcement learning (RL), which can introduce unexpected misalignment through reward hacking, deception, or other unintended strategies. We study whether RL on beneficial behavior, instantiated in realistic domains, can produce broad and persistent ali
延伸阅读
- Ensemble Feature Selection and Harris Hawks Optimization for Explainable Mental Health Risk Prediction in Female Sex Workers
- Breaking the Filter Bubble: A Semantic Pareto-DQN Framework for Multi-Objective Recommendation
- Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?
相关资讯
Ensemble Feature Selection and Harris Hawks Optimization for Explainable Mental Health Risk Prediction in Female Sex Workers
今天Breaking the Filter Bubble: A Semantic Pareto-DQN Framework for Multi-Objective Recommendation
今天Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?
今天Safe and Generalizable Hierarchical Multi-Agent RL via Constraint Manifold Control
今天