AICodexLab's picture
Update README.md
7839e63 verified
---
library_name: transformers
license: apache-2.0
base_model: answerdotai/ModernBERT-base
tags:
- text-classification
- ai-content-detection
- bert
- transformers
- generated_from_trainer
model-index:
- name: answerdotai-ModernBERT-base-ai-detector
results: []
---
# answerdotai-ModernBERT-base-ai-detector
This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on the AI vs Human Text Classification dataset, [DAIGT V2 Train Dataset](https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset/data).
It achieves the following results on the evaluation set:
- **Validation Loss:** `0.0036`
---
## **πŸ“ Model Description**
This model is based on **ModernBERT-base**, a lightweight and efficient BERT-based model.
It has been fine-tuned for **AI-generated vs Human-written text classification**, allowing it to distinguish between texts written by **AI models (ChatGPT, DeepSeek, Claude, etc.)** and human authors.
---
## **🎯 Intended Uses & Limitations**
### βœ… **Intended Uses**
- **AI-generated content detection** (e.g., ChatGPT, Claude, DeepSeek).
- **Text classification** for distinguishing human vs AI-generated content.
- **Educational & Research applications** for AI-content detection.
### ⚠️ **Limitations**
- **Not 100% accurate** β†’ Some AI texts may resemble human writing and vice versa.
- **Limited to trained dataset scope** β†’ May struggle with **out-of-domain** text.
- **Bias risks** β†’ If the dataset contains bias, the model may inherit it.
---
## **πŸ“Š Training and Evaluation Data**
- The model was fine-tuned on **35,894 training samples** and **8,974 test samples**.
- The dataset consists of **AI-generated text samples (ChatGPT, Claude, DeepSeek, etc.)** and **human-written samples (Wikipedia, books, articles)**.
- Labels:
- `1` β†’ AI-generated text
- `0` β†’ Human-written text
---
## **βš™οΈ Training Procedure**
### **Training Hyperparameters**
The following hyperparameters were used during training:
| Hyperparameter | Value |
|----------------------|--------------------|
| **Learning Rate** | `2e-5` |
| **Train Batch Size** | `16` |
| **Eval Batch Size** | `16` |
| **Optimizer** | `AdamW` (`Ξ²1=0.9, Ξ²2=0.999, Ξ΅=1e-08`) |
| **LR Scheduler** | `Linear` |
| **Epochs** | `3` |
| **Mixed Precision** | `Native AMP (fp16)` |
---
## **πŸ“ˆ Training Results**
| Training Loss | Epoch | Step | Validation Loss |
|--------------|--------|------|----------------|
| 0.0505 | 0.22 | 500 | 0.0214 |
| 0.0114 | 0.44 | 1000 | 0.0110 |
| 0.0088 | 0.66 | 1500 | 0.0032 |
| 0.0 | 0.89 | 2000 | 0.0048 |
| 0.0068 | 1.11 | 2500 | 0.0035 |
| 0.0 | 1.33 | 3000 | 0.0040 |
| 0.0 | 1.55 | 3500 | 0.0097 |
| 0.0053 | 1.78 | 4000 | 0.0101 |
| 0.0 | 2.00 | 4500 | 0.0053 |
| 0.0 | 2.22 | 5000 | 0.0039 |
| 0.0017 | 2.45 | 5500 | 0.0046 |
| 0.0 | 2.67 | 6000 | 0.0043 |
| 0.0 | 2.89 | 6500 | 0.0036 |
---
## **πŸ›  Framework Versions**
| Library | Version |
|--------------|------------|
| **Transformers** | `4.48.3` |
| **PyTorch** | `2.5.1+cu124` |
| **Datasets** | `3.3.2` |
| **Tokenizers** | `0.21.0` |
---
## **πŸ“€ Model Usage**
To load and use the model for text classification:
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model_name = "answerdotai/ModernBERT-base-ai-detector"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Create text classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
# Run classification
text = "This text was written by an AI model like ChatGPT."
result = classifier(text)
print(result)
```