SafeSwitch / README.md
HakHan's picture
Upload folder using huggingface_hub
c6a41ac verified

Refer to our code repo for usage.

refusal_head.pth: the refusal head.

direct_prober/: the direct prober from the last layer.

stage1_prober/: the prober to predict unsafe inputs from the last layer tokens.

stage2_prober/: the prober to predict mdoel compliance after decoding 3 tokens.

All probers are 2-layer MLPs with intermediate sizes of 64.