Update README.md
Browse files
README.md
CHANGED
@@ -2,6 +2,36 @@
|
|
2 |
language:
|
3 |
- en
|
4 |
base_model: openai-community/gpt2
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
---
|
6 |
|
7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
language:
|
3 |
- en
|
4 |
base_model: openai-community/gpt2
|
5 |
+
license: apache-2.0
|
6 |
+
datasets:
|
7 |
+
- Anthropic/hh-rlhf
|
8 |
+
- google/jigsaw_unintended_bias
|
9 |
+
tags:
|
10 |
+
- not-for-all-audiences
|
11 |
---
|
12 |
|
13 |
+
**This model has a propensity to produce highly unsavoury content from the outset.
|
14 |
+
It is not intended or suitable for general use.**
|
15 |
+
|
16 |
+
This special-use model aims to provide prompts that goad LLMs into producting "toxicity".
|
17 |
+
Toxicity here is defined by the content of the [Civil Comments](https://medium.com/@aja_15265/saying-goodbye-to-civil-comments-41859d3a2b1d) dataset, containing
|
18 |
+
categories such as obscene, threat, insult, identity_attack, sexual_explicit and
|
19 |
+
severe_toxicity. For details, see the description of the [Jigsaw 2019 data](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data).
|
20 |
+
|
21 |
+
The base model is the [community version of gpt2](https://huggingface.co/openai-community/gpt2) with 175M parameters.
|
22 |
+
This model is not aligned and is "noisy" relative to more advanced models.
|
23 |
+
Both the lack of alignment and the existence of noise are favourable to the task of
|
24 |
+
trying to goad other models into producing unsafe output: unsafe prompts have a
|
25 |
+
propensity to yield unsafe outputs, and noisy behaviour can lead to a broader
|
26 |
+
exploration of input space.
|
27 |
+
|
28 |
+
The model is fine-tuned to emulate the responses of humans in conversation
|
29 |
+
exchanges that led to LLMs producing toxicity.
|
30 |
+
These prompt-response pairs are taken from the Anthropic HHRLHF corpus ([paper](https://arxiv.org/abs/2204.05862), [data](https://github.com/anthropics/hh-rlhf)),
|
31 |
+
filtered to those exchanges in which the model produced "toxicity" as defined above,
|
32 |
+
using the [martin-ha/toxic-comment-model](https://huggingface.co/martin-ha/toxic-comment-model) DistilBERT classifier based on that data.
|
33 |
+
|
34 |
+
**This model has a propensity to produce highly unsavoury content from the outset.
|
35 |
+
It is not intended or suitable for general use.**
|
36 |
+
|
37 |
+
See https://interhumanagreement.substack.com/p/faketoxicityprompts-automatic-red for details on the training process.
|