Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage. | |
`refusal_head.pth`: the refusal head. | |
`direct_prober/`: the direct prober from the last layer. | |
`stage1_prober/`: the prober to predict unsafe inputs from the last layer tokens. | |
`stage2_prober/`: the prober to predict mdoel compliance after decoding 3 tokens. | |
All probers are 2-layer MLPs with intermediate sizes of 64. |