File size: 1,806 Bytes
70748d9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
# GPT-NYC-nontoxic
## About
GPT2 (small version on HF) fine-tuned on questions and responses from https://reddit.com/r/asknyc
I filtered comments to ones with scores >= 3, and responding directly
to the original post ( = ignoring responses to other commenters).
I also added many tokens which were common on /r/AskNYC but missing from
GPT2.
Additional <Toxic> and <NonToxic> tokens control following output.
Toxic comments (about 5.5% of input data) are those which were flagged
by [Perspective API](https://developers.perspectiveapi.com) with toxicity > 0.7,
or by [English DeHateBERT](https://huggingface.co/Hate-speech-CNERG/dehatebert-mono-english),
with <NonToxic> tagging for all comments related to LGBTQ identity
to avoid false positives / more aggressive censorship from these classifiers.
Try prompting with ```question? - additional info %% <Toxic> ```
Or ```question? - additional info %% <NonToxic>```
## Other options
The [gpt-nyc-small](https://huggingface.co/monsoon-nlp/gpt-nyc-small) repo is based
on GPT2 [small] but without the <Toxic> and <NonToxic> tags. It is the most
directly comparable model to this one.
The main [gpt-nyc](https://huggingface.co/monsoon-nlp/gpt-nyc) repo is based
on GPT2-Medium and comes off more accurate. It does not have Toxic/NonToxic tagging.
## Blog
Initial model: https://mapmeld.medium.com/gpt-nyc-part-1-9cb698b2e3d
## Notebooks
### Data processing / new tokens
https://colab.research.google.com/drive/13BOw0uekoAYB4jjQtaXTn6J_VHatiRLu
### Fine-tuning GPT2 (small)
https://colab.research.google.com/drive/1FnXcAh4H-k8dAzixkV5ieygV96ePh3lR
### Predictive text and probabilities
Scroll to end of
https://colab.research.google.com/drive/1FnXcAh4H-k8dAzixkV5ieygV96ePh3lR
to see how to install git-lfs and trick ecco into loading this.
|