Refusal in Language Models Is Mediated by a Single Direction Paper • 2406.11717 • Published Jun 17, 2024 • 2
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs Paper • 2502.17424 • Published 29 days ago • 4