GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment Paper • 2410.08193 • Published Oct 10, 2024 • 4 • 2
GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment Paper • 2410.08193 • Published Oct 10, 2024 • 4
PHTest Collection Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models • 3 items • Updated Sep 24, 2024
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models Paper • 2310.15140 • Published Oct 23, 2023 • 1