Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?
arXiv:2606.24026v1 Announce Type: new Abstract: Mechanistic interpretability has made substantial progress in automatically localizing circuits, but explaining what localized components do remains labor-intensive and difficult to standardize. In this work, we study whether language model (LM) agents can assist with this explanation problem once a circuit has already been identified. We introduce AgenticInterpBench, a benchmark for circuit explanation built from 84 semi-synthetic transformer circ
延伸阅读
相关资讯
Ensemble Feature Selection and Harris Hawks Optimization for Explainable Mental Health Risk Prediction in Female Sex Workers
今天Breaking the Filter Bubble: A Semantic Pareto-DQN Framework for Multi-Objective Recommendation
今天Reinforcement Learning Towards Broadly and Persistently Beneficial Models
今天Safe and Generalizable Hierarchical Multi-Agent RL via Constraint Manifold Control
今天