File size: 4,132 Bytes

af9d79b
 
 
 
 
1cb7350
 
 
 
af9d79b
 
 
 
 
 
 
 
7839e63
1cb7350
af9d79b
1cb7350
af9d79b
1cb7350
af9d79b
1cb7350
 
 
af9d79b
1cb7350
af9d79b
1cb7350
 
 
 
 
af9d79b
1cb7350
 
 
 
af9d79b
1cb7350
af9d79b
1cb7350
 
 
 
 
 
af9d79b
1cb7350
af9d79b
1cb7350
 
af9d79b
 
1cb7350
 
 
 
 
 
 
 
 
 
 
af9d79b
1cb7350
af9d79b
1cb7350

---
library_name: transformers
license: apache-2.0
base_model: answerdotai/ModernBERT-base
tags:
- text-classification
- ai-content-detection
- bert
- transformers
- generated_from_trainer
model-index:
- name: answerdotai-ModernBERT-base-ai-detector
  results: []
---

# answerdotai-ModernBERT-base-ai-detector

This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on the AI vs Human Text Classification dataset, [DAIGT V2 Train Dataset](https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset/data).

It achieves the following results on the evaluation set:
- **Validation Loss:** `0.0036`

---

## **📝 Model Description**
This model is based on **ModernBERT-base**, a lightweight and efficient BERT-based model.  
It has been fine-tuned for **AI-generated vs Human-written text classification**, allowing it to distinguish between texts written by **AI models (ChatGPT, DeepSeek, Claude, etc.)** and human authors.  

---

## **🎯 Intended Uses & Limitations**
### ✅ **Intended Uses**
- **AI-generated content detection** (e.g., ChatGPT, Claude, DeepSeek).
- **Text classification** for distinguishing human vs AI-generated content.
- **Educational & Research applications** for AI-content detection.

### ⚠️ **Limitations**
- **Not 100% accurate** → Some AI texts may resemble human writing and vice versa.
- **Limited to trained dataset scope** → May struggle with **out-of-domain** text.
- **Bias risks** → If the dataset contains bias, the model may inherit it.

---

## **📊 Training and Evaluation Data**
- The model was fine-tuned on **35,894 training samples** and **8,974 test samples**.
- The dataset consists of **AI-generated text samples (ChatGPT, Claude, DeepSeek, etc.)** and **human-written samples (Wikipedia, books, articles)**.
- Labels:
  - `1` → AI-generated text
  - `0` → Human-written text

---

## **⚙️ Training Procedure**
### **Training Hyperparameters**
The following hyperparameters were used during training:

| Hyperparameter        | Value                |
|----------------------|--------------------|
| **Learning Rate**    | `2e-5`             |
| **Train Batch Size** | `16`               |
| **Eval Batch Size**  | `16`               |
| **Optimizer**        | `AdamW` (`β1=0.9, β2=0.999, ε=1e-08`) |
| **LR Scheduler**     | `Linear`           |
| **Epochs**          | `3`                |
| **Mixed Precision**  | `Native AMP (fp16)` |

---

## **📈 Training Results**
| Training Loss | Epoch  | Step | Validation Loss |
|--------------|--------|------|----------------|
| 0.0505       | 0.22   | 500  | 0.0214         |
| 0.0114       | 0.44   | 1000 | 0.0110         |
| 0.0088       | 0.66   | 1500 | 0.0032         |
| 0.0          | 0.89   | 2000 | 0.0048         |
| 0.0068       | 1.11   | 2500 | 0.0035         |
| 0.0          | 1.33   | 3000 | 0.0040         |
| 0.0          | 1.55   | 3500 | 0.0097         |
| 0.0053       | 1.78   | 4000 | 0.0101         |
| 0.0          | 2.00   | 4500 | 0.0053         |
| 0.0          | 2.22   | 5000 | 0.0039         |
| 0.0017       | 2.45   | 5500 | 0.0046         |
| 0.0          | 2.67   | 6000 | 0.0043         |
| 0.0          | 2.89   | 6500 | 0.0036         |

---

## **🛠 Framework Versions**
| Library       | Version     |
|--------------|------------|
| **Transformers** | `4.48.3`  |
| **PyTorch**      | `2.5.1+cu124` |
| **Datasets**     | `3.3.2`  |
| **Tokenizers**   | `0.21.0` |

---

## **📤 Model Usage**
To load and use the model for text classification:
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

model_name = "answerdotai/ModernBERT-base-ai-detector"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create text classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Run classification
text = "This text was written by an AI model like ChatGPT."
result = classifier(text)

print(result)
```