Update index.html
Browse files- index.html +1 -1
index.html
CHANGED
|
@@ -91,7 +91,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 91 |
</div>
|
| 92 |
|
| 93 |
<p>
|
| 94 |
-
From
|
| 95 |
the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
|
| 96 |
the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
|
| 97 |
is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
|
|
|
|
| 91 |
</div>
|
| 92 |
|
| 93 |
<p>
|
| 94 |
+
From this plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
|
| 95 |
the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
|
| 96 |
the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
|
| 97 |
is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
|