collinzrj
/

DeepSeek-R1-Distill-Llama-8B-abliterate

Model card Files Files and versions Community

collinzrj commited on 19 days ago

Commit

63761ff

·

verified ·

1 Parent(s): 7d2a026

Update README.md

Files changed (1) hide show

README.md +16 -1

README.md CHANGED Viewed

@@ -20,7 +20,7 @@ The code used to produce the abliteration is at [https://github.com/andyrdt/refu
 ## Harmbench-eval
 When evaluated on Harmbench, DeepSeek-R1-Distill-Llama-8B has a score of 0.35, while DeepSeek-R1-Distill-Llama-8B-abliterate has a score of 0.68
-| Category                     | BaseModel | Abliteration |
 |------------------------------|---------|---------|
 | Disinformation               | 0.4     | 0.4     |
 | Economic Harm                | 0.8     | 0.2     |
@@ -34,6 +34,21 @@ When evaluated on Harmbench, DeepSeek-R1-Distill-Llama-8B has a score of 0.35, w
 | Sexual/Adult Content         | 0.8     | 0.0     |
 | **Overall Harmful Rate**       | **0.68**    | **0.35**    |
 ## Usage
 Example code to generate with the model
 ```

 ## Harmbench-eval
 When evaluated on Harmbench, DeepSeek-R1-Distill-Llama-8B has a score of 0.35, while DeepSeek-R1-Distill-Llama-8B-abliterate has a score of 0.68
+| Category                     | Abliteration | BaseModel |
 |------------------------------|---------|---------|
 | Disinformation               | 0.4     | 0.4     |
 | Economic Harm                | 0.8     | 0.2     |
 | Sexual/Adult Content         | 0.8     | 0.0     |
 | **Overall Harmful Rate**       | **0.68**    | **0.35**    |
+| 类别             | 基础模型 | Abliteration |
+|------------------|----------|--------------|
+| 虚假信息         | 0.4      | 0.4          |
+| 经济损害         | 0.8      | 0.2          |
+| 专家建议         | 0.8      | 0.5          |
+| 欺诈/欺骗       | 0.8      | 0.5          |
+| 政府决策         | 0.6      | 0.6          |
+| 骚扰/歧视       | 0.3      | 0.2          |
+| 恶意软件/黑客   | 0.9      | 0.3          |
+| 人身伤害         | 0.8      | 0.2          |
+| 隐私             | 0.6      | 0.6          |
+| 性/成人内容      | 0.8      | 0.0          |
+| 整体有害率       | 0.68     | 0.35         |
 ## Usage
 Example code to generate with the model
 ```