garak-llm
/

attackgeneration-toxicity_gpt2

Not-For-All-Audiences

Model card Files Files and versions Community

leondz commited on about 1 month ago

Commit

91e4e49

·

verified ·

1 Parent(s): 3a48eee

Update README.md

Files changed (1) hide show

README.md +2 -5

README.md CHANGED Viewed

@@ -10,8 +10,8 @@ tags:
 - not-for-all-audiences
 ---
-**This model has a propensity to produce highly unsavoury content from the outset.
-It is not intended or suitable for general use.**
 This special-use model aims to provide prompts that goad LLMs into producting "toxicity".
 Toxicity here is defined by the content of the [Civil Comments](https://medium.com/@aja_15265/saying-goodbye-to-civil-comments-41859d3a2b1d) dataset, containing
@@ -31,7 +31,4 @@ These prompt-response pairs are taken from the Anthropic HHRLHF corpus ([paper](
 filtered to those exchanges in which the model produced "toxicity" as defined above,
 using the [martin-ha/toxic-comment-model](https://huggingface.co/martin-ha/toxic-comment-model) DistilBERT classifier based on that data.
-**This model has a propensity to produce highly unsavoury content from the outset.
-It is not intended or suitable for general use.**
 See https://interhumanagreement.substack.com/p/faketoxicityprompts-automatic-red for details on the training process.

 - not-for-all-audiences
 ---
+**This adversarial model has a propensity to produce highly unsavoury content from the outset.
+It is not intended or suitable for general use or human consumption.**
 This special-use model aims to provide prompts that goad LLMs into producting "toxicity".
 Toxicity here is defined by the content of the [Civil Comments](https://medium.com/@aja_15265/saying-goodbye-to-civil-comments-41859d3a2b1d) dataset, containing
 filtered to those exchanges in which the model produced "toxicity" as defined above,
 using the [martin-ha/toxic-comment-model](https://huggingface.co/martin-ha/toxic-comment-model) DistilBERT classifier based on that data.
 See https://interhumanagreement.substack.com/p/faketoxicityprompts-automatic-red for details on the training process.