|
# Jigsaw Toxic Comment Classification Dataset |
|
|
|
## Overview |
|
Version: 1.0 |
|
Date Created: 2025-02-03 |
|
|
|
### Description |
|
|
|
The Jigsaw Toxic Comment Classification Dataset is designed to help identify and classify toxic online comments. |
|
It contains text comments with multiple toxicity-related labels including general toxicity, severe toxicity, |
|
obscenity, threats, insults, and identity-based hate speech. |
|
|
|
The dataset includes: |
|
1. Main training data with binary toxicity labels |
|
2. Unintended bias training data with additional identity attributes |
|
3. Processed versions with sequence length 128 for direct model input |
|
4. Test and validation sets for model evaluation |
|
|
|
This dataset was created by Jigsaw and Google's Conversation AI team to help improve online conversation quality |
|
by identifying and classifying various forms of toxic comments. |
|
|
|
|
|
## Column Descriptions |
|
|
|
- **id**: Unique identifier for each comment |
|
- **comment_text**: The text content of the comment to be classified |
|
- **toxic**: Binary label indicating if the comment is toxic |
|
- **severe_toxic**: Binary label for extremely toxic comments |
|
- **obscene**: Binary label for obscene content |
|
- **threat**: Binary label for threatening content |
|
- **insult**: Binary label for insulting content |
|
- **identity_hate**: Binary label for identity-based hate speech |
|
- **target**: Overall toxicity score (in bias dataset) |
|
- **identity_attack**: Binary label for identity-based attacks |
|
- **identity_***: Various identity-related attributes in the bias dataset |
|
- **lang**: Language of the comment |
|
|
|
## Files |
|
|
|
|