Koushim commited on
Commit
d0ee7f4
·
verified ·
1 Parent(s): 10539ef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -44
README.md CHANGED
@@ -1,65 +1,116 @@
1
  ---
2
- library_name: transformers
 
 
3
  tags:
4
- - generated_from_trainer
5
- metrics:
6
- - accuracy
7
- - f1
8
- - precision
9
- - recall
 
10
  model-index:
11
- - name: bert-multilabel-jigsaw-toxic-classifier
12
- results: []
 
 
 
 
 
 
 
 
 
 
13
  ---
14
 
15
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
- should probably proofread and complete it, then remove this comment. -->
17
 
18
- # bert-multilabel-jigsaw-toxic-classifier
19
 
20
- This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
21
- It achieves the following results on the evaluation set:
22
- - Loss: 1.6768
23
- - Accuracy: 0.9187
24
- - F1: 0.0
25
- - Precision: 0.0
26
- - Recall: 0.0
 
27
 
28
- ## Model description
29
 
30
- More information needed
 
 
 
 
 
 
31
 
32
- ## Intended uses & limitations
33
 
34
- More information needed
 
35
 
36
- ## Training and evaluation data
 
37
 
38
- More information needed
 
 
39
 
40
- ## Training procedure
 
 
 
 
41
 
42
- ### Training hyperparameters
43
 
44
- The following hyperparameters were used during training:
45
- - learning_rate: 5e-05
46
- - train_batch_size: 16
47
- - eval_batch_size: 64
48
- - seed: 42
49
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
50
- - lr_scheduler_type: linear
51
- - num_epochs: 1
 
52
 
53
- ### Training results
54
 
55
- | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall |
56
- |:-------------:|:-----:|:------:|:---------------:|:--------:|:---:|:---------:|:------:|
57
- | 1.3585 | 1.0 | 112805 | 1.6768 | 0.9187 | 0.0 | 0.0 | 0.0 |
 
 
 
58
 
 
59
 
60
- ### Framework versions
 
 
 
61
 
62
- - Transformers 4.51.3
63
- - Pytorch 2.6.0+cu124
64
- - Datasets 3.6.0
65
- - Tokenizers 0.21.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ datasets:
4
+ - jigsaw-toxic-comment-classification-challenge
5
  tags:
6
+ - text-classification
7
+ - multi-label-classification
8
+ - toxicity-detection
9
+ - bert
10
+ - transformers
11
+ - pytorch
12
+ license: apache-2.0
13
  model-index:
14
+ - name: BERT Multi-label Toxic Comment Classifier
15
+ results:
16
+ - task:
17
+ name: Multi-label Text Classification
18
+ type: multi-label-classification
19
+ dataset:
20
+ name: Jigsaw Toxic Comment Classification Challenge
21
+ type: jigsaw-toxic-comment-classification-challenge
22
+ metrics:
23
+ - name: Accuracy
24
+ type: accuracy
25
+ value: 0.9187 # Replace with your actual score
26
  ---
27
 
28
+ # BERT Multi-label Toxic Comment Classifier
 
29
 
30
+ This model is a fine-tuned [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) transformer for **multi-label classification** on the [Jigsaw Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) dataset.
31
 
32
+ It predicts multiple toxicity-related labels per comment, including:
33
+ - toxicity
34
+ - severe toxicity
35
+ - obscene
36
+ - threat
37
+ - insult
38
+ - identity attack
39
+ - sexual explicit
40
 
41
+ ## Model Details
42
 
43
+ - **Base Model**: `bert-base-uncased`
44
+ - **Task**: Multi-label text classification
45
+ - **Dataset**: Jigsaw Toxic Comment Classification Challenge (processed version)
46
+ - **Labels**: 7 toxicity-related categories
47
+ - **Training Epochs**: 2
48
+ - **Batch Size**: 16 (train), 64 (eval)
49
+ - **Metrics**: Accuracy, Macro F1, Precision, Recall
50
 
51
+ ## Usage
52
 
53
+ ```python
54
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
55
 
56
+ tokenizer = AutoTokenizer.from_pretrained("Koushim/bert-multilabel-jigsaw-toxic-classifier")
57
+ model = AutoModelForSequenceClassification.from_pretrained("Koushim/bert-multilabel-jigsaw-toxic-classifier")
58
 
59
+ text = "You are a wonderful person!"
60
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
61
+ outputs = model(**inputs)
62
 
63
+ # Sigmoid to get probabilities for each label
64
+ import torch
65
+ probs = torch.sigmoid(outputs.logits)
66
+ print(probs)
67
+ ````
68
 
69
+ ## Labels
70
 
71
+ | Index | Label |
72
+ | ----- | ---------------- |
73
+ | 0 | toxicity |
74
+ | 1 | severe_toxicity |
75
+ | 2 | obscene |
76
+ | 3 | threat |
77
+ | 4 | insult |
78
+ | 5 | identity_attack |
79
+ | 6 | sexual_explicit |
80
 
81
+ ## Training Details
82
 
83
+ * Training Set: Full dataset (160k+ samples)
84
+ * Loss Function: Binary Cross Entropy (via `BertForSequenceClassification` with `problem_type="multi_label_classification"`)
85
+ * Optimizer: AdamW
86
+ * Learning Rate: 2e-5
87
+ * Evaluation Strategy: Epoch-based evaluation with early stopping on F1 score
88
+ * Model Framework: PyTorch with Hugging Face Transformers
89
 
90
+ ## Repository Contents
91
 
92
+ * `pytorch_model.bin` - trained model weights
93
+ * `config.json` - model configuration
94
+ * `tokenizer.json`, `vocab.txt` - tokenizer files
95
+ * `README.md` - this file
96
 
97
+ ## How to Fine-tune or Train
98
+
99
+ You can fine-tune this model using the Hugging Face `Trainer` API with your own dataset or the original Jigsaw dataset.
100
+
101
+ ## Citation
102
+
103
+ If you use this model in your research or project, please cite:
104
+
105
+ ```
106
+ @article{devlin2019bert,
107
+ title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
108
+ author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
109
+ journal={arXiv preprint arXiv:1810.04805},
110
+ year={2019}
111
+ }
112
+ ```
113
+
114
+ ## License
115
+
116
+ Apache 2.0 License