adding table for llama
Browse files- index.html +84 -5
index.html
CHANGED
|
@@ -715,15 +715,14 @@
|
|
| 715 |
<div class="content has-text-justified">
|
| 716 |
<p>In this section we want to show our <strong>numerical results</strong> as well as <strong>our trained DPP</strong> on both LLAMA-2-Chat
|
| 717 |
and MISTRAL-7B-Instruct-v0.2.</p>
|
| 718 |
-
<
|
| 719 |
<ul>
|
| 720 |
-
<li><strong>Attack Success Rate:</strong>We use the Attack Success Rate (ASR) as our primary metric for evaluating the effectiveness of jailbreak defenses
|
| 721 |
-
The ASR measures the proportion of malicious queries that successfully bypass the LLMs alignment and generate harmful responses.</li>
|
| 722 |
<p><b>ASR</b> is defined as:</p>
|
| 723 |
<p>\[
|
| 724 |
-
\textbf{ASR} = \frac{\text{Number
|
| 725 |
\]</p>
|
| 726 |
-
<p>Here the \(\text{Number
|
| 727 |
<p>The function to determine if a response is jailbroken can be expressed as:</p>
|
| 728 |
<p>\[
|
| 729 |
\text{JailBroken}(\text{response}) = \begin{cases}
|
|
@@ -731,7 +730,87 @@
|
|
| 731 |
0, & \text{otherwise.}
|
| 732 |
\end{cases}
|
| 733 |
\]</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 734 |
</ul>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 735 |
</div>
|
| 736 |
</div>
|
| 737 |
</div>
|
|
|
|
| 715 |
<div class="content has-text-justified">
|
| 716 |
<p>In this section we want to show our <strong>numerical results</strong> as well as <strong>our trained DPP</strong> on both LLAMA-2-Chat
|
| 717 |
and MISTRAL-7B-Instruct-v0.2.</p>
|
| 718 |
+
<h3>Evaluation Metrics:</h3>
|
| 719 |
<ul>
|
| 720 |
+
<li><strong>Attack Success Rate:</strong> We use the Attack Success Rate (ASR) as our primary metric for evaluating the effectiveness of jailbreak defenses.</li>
|
|
|
|
| 721 |
<p><b>ASR</b> is defined as:</p>
|
| 722 |
<p>\[
|
| 723 |
+
\textbf{ASR} = \frac{\text{Number of jailbreak queries}}{\text{Total queries}}
|
| 724 |
\]</p>
|
| 725 |
+
<p>Here the \(\text{Number of jailbreak queries}\) is calculated through the sub-strings matching. Specifically, for a given generated response of a jailbreak query, if the response contains sub-strings that exist in the pre-defined sub-string set \(S\). Then, it will be evaluated as <b>jailbroken</b>, otherwise it is <b>non-jailbroken</b>.</p>
|
| 726 |
<p>The function to determine if a response is jailbroken can be expressed as:</p>
|
| 727 |
<p>\[
|
| 728 |
\text{JailBroken}(\text{response}) = \begin{cases}
|
|
|
|
| 730 |
0, & \text{otherwise.}
|
| 731 |
\end{cases}
|
| 732 |
\]</p>
|
| 733 |
+
<li><strong>Win-Rate:</strong> We utilize AlpacaEval to measure the impact on the LLM model's utility when defenses are in place.
|
| 734 |
+
In particular, we apply a metric termed Win-Rate. This metric involves assessing the frequency at which the LLM's outputs are selected over those from a
|
| 735 |
+
benchmark model when following specific user instructions. By adopting the simulated Win-Rate, we can directly compare the performance of various LLMs against
|
| 736 |
+
a consistent benchmark model.</li>
|
| 737 |
</ul>
|
| 738 |
+
|
| 739 |
+
<h3>Numerical Results:</h3>
|
| 740 |
+
<table border="1" style="width:100%; text-align:center;">
|
| 741 |
+
<caption>Attack Success Rates (ASRs) and Win-Rates (utility) on LLAMA-2-7B-Chat model across six different jailbreak attacks. Our method can achieve the lowest Average ASR and highest Win-Rate against other defense baselines. The arrow's direction signals improvement, the same below.</caption>
|
| 742 |
+
<thead>
|
| 743 |
+
<tr>
|
| 744 |
+
<th>Methods</th>
|
| 745 |
+
<th>Base64 [$\downarrow$]</th>
|
| 746 |
+
<th>ICA [$\downarrow$]</th>
|
| 747 |
+
<th>AutoDAN [$\downarrow$]</th>
|
| 748 |
+
<th>GCG [$\downarrow$]</th>
|
| 749 |
+
<th>PAIR [$\downarrow$]</th>
|
| 750 |
+
<th>TAP [$\downarrow$]</th>
|
| 751 |
+
<th>Average ASR [$\downarrow$]</th>
|
| 752 |
+
<th>Win-Rate [$\uparrow$]</th>
|
| 753 |
+
</tr>
|
| 754 |
+
</thead>
|
| 755 |
+
<tbody>
|
| 756 |
+
<tr>
|
| 757 |
+
<td>w/o defense</td>
|
| 758 |
+
<td>0.990</td>
|
| 759 |
+
<td>0.690</td>
|
| 760 |
+
<td>0.640</td>
|
| 761 |
+
<td>0.550</td>
|
| 762 |
+
<td>0.100</td>
|
| 763 |
+
<td>0.120</td>
|
| 764 |
+
<td>0.515</td>
|
| 765 |
+
<td>81.37</td>
|
| 766 |
+
</tr>
|
| 767 |
+
<tr>
|
| 768 |
+
<td>RPO <a href="#rpo">[rpo]</a></td>
|
| 769 |
+
<td>0.000</td>
|
| 770 |
+
<td>0.420</td>
|
| 771 |
+
<td>0.280</td>
|
| 772 |
+
<td>0.190</td>
|
| 773 |
+
<td>0.060</td>
|
| 774 |
+
<td>0.060</td>
|
| 775 |
+
<td>0.168</td>
|
| 776 |
+
<td>79.23</td>
|
| 777 |
+
</tr>
|
| 778 |
+
<tr>
|
| 779 |
+
<td>Goal Prioritization <a href="#goal_prior">[goal_prior]</a></td>
|
| 780 |
+
<td>0.000</td>
|
| 781 |
+
<td>0.020</td>
|
| 782 |
+
<td>0.520</td>
|
| 783 |
+
<td>0.020</td>
|
| 784 |
+
<td>0.020</td>
|
| 785 |
+
<td>0.020</td>
|
| 786 |
+
<td>0.100</td>
|
| 787 |
+
<td>34.29</td>
|
| 788 |
+
</tr>
|
| 789 |
+
<tr>
|
| 790 |
+
<td>Self-Reminder <a href="#self_reminder">[self_reminder]</a></td>
|
| 791 |
+
<td>0.030</td>
|
| 792 |
+
<td>0.290</td>
|
| 793 |
+
<td>0.000</td>
|
| 794 |
+
<td>0.040</td>
|
| 795 |
+
<td>0.020</td>
|
| 796 |
+
<td>0.000</td>
|
| 797 |
+
<td>0.063</td>
|
| 798 |
+
<td>64.84</td>
|
| 799 |
+
</tr>
|
| 800 |
+
<tr>
|
| 801 |
+
<td>DPP (Ours)</td>
|
| 802 |
+
<td>0.010</td>
|
| 803 |
+
<td>0.000</td>
|
| 804 |
+
<td>0.100</td>
|
| 805 |
+
<td>0.040</td>
|
| 806 |
+
<td>0.040</td>
|
| 807 |
+
<td>0.040</td>
|
| 808 |
+
<td><strong>0.038</strong></td>
|
| 809 |
+
<td><strong>82.98</strong></td>
|
| 810 |
+
</tr>
|
| 811 |
+
</tbody>
|
| 812 |
+
</table>
|
| 813 |
+
|
| 814 |
</div>
|
| 815 |
</div>
|
| 816 |
</div>
|