File size: 4,132 Bytes
af9d79b 1cb7350 af9d79b 7839e63 1cb7350 af9d79b 1cb7350 af9d79b 1cb7350 af9d79b 1cb7350 af9d79b 1cb7350 af9d79b 1cb7350 af9d79b 1cb7350 af9d79b 1cb7350 af9d79b 1cb7350 af9d79b 1cb7350 af9d79b 1cb7350 af9d79b 1cb7350 af9d79b 1cb7350 af9d79b 1cb7350 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
---
library_name: transformers
license: apache-2.0
base_model: answerdotai/ModernBERT-base
tags:
- text-classification
- ai-content-detection
- bert
- transformers
- generated_from_trainer
model-index:
- name: answerdotai-ModernBERT-base-ai-detector
results: []
---
# answerdotai-ModernBERT-base-ai-detector
This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on the AI vs Human Text Classification dataset, [DAIGT V2 Train Dataset](https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset/data).
It achieves the following results on the evaluation set:
- **Validation Loss:** `0.0036`
---
## **π Model Description**
This model is based on **ModernBERT-base**, a lightweight and efficient BERT-based model.
It has been fine-tuned for **AI-generated vs Human-written text classification**, allowing it to distinguish between texts written by **AI models (ChatGPT, DeepSeek, Claude, etc.)** and human authors.
---
## **π― Intended Uses & Limitations**
### β
**Intended Uses**
- **AI-generated content detection** (e.g., ChatGPT, Claude, DeepSeek).
- **Text classification** for distinguishing human vs AI-generated content.
- **Educational & Research applications** for AI-content detection.
### β οΈ **Limitations**
- **Not 100% accurate** β Some AI texts may resemble human writing and vice versa.
- **Limited to trained dataset scope** β May struggle with **out-of-domain** text.
- **Bias risks** β If the dataset contains bias, the model may inherit it.
---
## **π Training and Evaluation Data**
- The model was fine-tuned on **35,894 training samples** and **8,974 test samples**.
- The dataset consists of **AI-generated text samples (ChatGPT, Claude, DeepSeek, etc.)** and **human-written samples (Wikipedia, books, articles)**.
- Labels:
- `1` β AI-generated text
- `0` β Human-written text
---
## **βοΈ Training Procedure**
### **Training Hyperparameters**
The following hyperparameters were used during training:
| Hyperparameter | Value |
|----------------------|--------------------|
| **Learning Rate** | `2e-5` |
| **Train Batch Size** | `16` |
| **Eval Batch Size** | `16` |
| **Optimizer** | `AdamW` (`Ξ²1=0.9, Ξ²2=0.999, Ξ΅=1e-08`) |
| **LR Scheduler** | `Linear` |
| **Epochs** | `3` |
| **Mixed Precision** | `Native AMP (fp16)` |
---
## **π Training Results**
| Training Loss | Epoch | Step | Validation Loss |
|--------------|--------|------|----------------|
| 0.0505 | 0.22 | 500 | 0.0214 |
| 0.0114 | 0.44 | 1000 | 0.0110 |
| 0.0088 | 0.66 | 1500 | 0.0032 |
| 0.0 | 0.89 | 2000 | 0.0048 |
| 0.0068 | 1.11 | 2500 | 0.0035 |
| 0.0 | 1.33 | 3000 | 0.0040 |
| 0.0 | 1.55 | 3500 | 0.0097 |
| 0.0053 | 1.78 | 4000 | 0.0101 |
| 0.0 | 2.00 | 4500 | 0.0053 |
| 0.0 | 2.22 | 5000 | 0.0039 |
| 0.0017 | 2.45 | 5500 | 0.0046 |
| 0.0 | 2.67 | 6000 | 0.0043 |
| 0.0 | 2.89 | 6500 | 0.0036 |
---
## **π Framework Versions**
| Library | Version |
|--------------|------------|
| **Transformers** | `4.48.3` |
| **PyTorch** | `2.5.1+cu124` |
| **Datasets** | `3.3.2` |
| **Tokenizers** | `0.21.0` |
---
## **π€ Model Usage**
To load and use the model for text classification:
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model_name = "answerdotai/ModernBERT-base-ai-detector"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Create text classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
# Run classification
text = "This text was written by an AI model like ChatGPT."
result = classifier(text)
print(result)
``` |