Safetensors
English
gpt2
Not-For-All-Audiences
leondz commited on
Commit
3a48eee
·
verified ·
1 Parent(s): 31fcab3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -1
README.md CHANGED
@@ -2,6 +2,36 @@
2
  language:
3
  - en
4
  base_model: openai-community/gpt2
 
 
 
 
 
 
5
  ---
6
 
7
- See https://interhumanagreement.substack.com/p/faketoxicityprompts-automatic-red
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  language:
3
  - en
4
  base_model: openai-community/gpt2
5
+ license: apache-2.0
6
+ datasets:
7
+ - Anthropic/hh-rlhf
8
+ - google/jigsaw_unintended_bias
9
+ tags:
10
+ - not-for-all-audiences
11
  ---
12
 
13
+ **This model has a propensity to produce highly unsavoury content from the outset.
14
+ It is not intended or suitable for general use.**
15
+
16
+ This special-use model aims to provide prompts that goad LLMs into producting "toxicity".
17
+ Toxicity here is defined by the content of the [Civil Comments](https://medium.com/@aja_15265/saying-goodbye-to-civil-comments-41859d3a2b1d) dataset, containing
18
+ categories such as obscene, threat, insult, identity_attack, sexual_explicit and
19
+ severe_toxicity. For details, see the description of the [Jigsaw 2019 data](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data).
20
+
21
+ The base model is the [community version of gpt2](https://huggingface.co/openai-community/gpt2) with 175M parameters.
22
+ This model is not aligned and is "noisy" relative to more advanced models.
23
+ Both the lack of alignment and the existence of noise are favourable to the task of
24
+ trying to goad other models into producing unsafe output: unsafe prompts have a
25
+ propensity to yield unsafe outputs, and noisy behaviour can lead to a broader
26
+ exploration of input space.
27
+
28
+ The model is fine-tuned to emulate the responses of humans in conversation
29
+ exchanges that led to LLMs producing toxicity.
30
+ These prompt-response pairs are taken from the Anthropic HHRLHF corpus ([paper](https://arxiv.org/abs/2204.05862), [data](https://github.com/anthropics/hh-rlhf)),
31
+ filtered to those exchanges in which the model produced "toxicity" as defined above,
32
+ using the [martin-ha/toxic-comment-model](https://huggingface.co/martin-ha/toxic-comment-model) DistilBERT classifier based on that data.
33
+
34
+ **This model has a propensity to produce highly unsavoury content from the outset.
35
+ It is not intended or suitable for general use.**
36
+
37
+ See https://interhumanagreement.substack.com/p/faketoxicityprompts-automatic-red for details on the training process.