Refer to our code repo for usage.
refusal_head.pth
: the refusal head.
direct_prober/
: the direct prober from the last layer.
stage1_prober/
: the prober to predict unsafe inputs from the last layer tokens.
stage2_prober/
: the prober to predict mdoel compliance after decoding 3 tokens.
All probers are 2-layer MLPs with intermediate sizes of 64.