imdb-distilbert / README.md
NikkeS's picture
Update README.md
b5705c3 verified
---
library_name: transformers
tags:
- sentiment-analysis
- imdb
- text-classification
- distilbert
license: apache-2.0
datasets:
- stanfordnlp/imdb
language:
- en
metrics:
- accuracy
- precision
- recall
- f1
base_model:
- distilbert/distilbert-base-uncased
pipeline_tag: text-classification
---
# Model Card for DistilBERT Fine-Tuned on IMDB Sentiment Analysis
## Model Details
### Model Description
This model is a fine-tuned version of `distilbert-base-uncased` on the **IMDB movie reviews dataset** for **binary sentiment classification** (positive vs. negative). The model has been trained to classify movie reviews into either **positive (1)** or **negative (0)** sentiments.
- **Developed by:** Nikke Salonen
- **Finetuned from model:** `distilbert-base-uncased`
- **Language(s):** English
- **License:** Apache 2.0
### Model Sources
- **Repository:** https://huggingface.co/NikkeS/imdb-distilbert/
- **Dataset:** [IMDB Dataset](https://ai.stanford.edu/~amaas/data/sentiment/)
## Uses
### Direct Use
- Sentiment analysis of **English text reviews**.
- Can be used for **opinion mining** on movie reviews and similar datasets.
### Downstream Use
- Can be **fine-tuned further** for sentiment classification in other domains (e.g., product reviews, social media sentiment analysis).
### Out-of-Scope Use
- Not suitable for **languages other than English**.
- Not recommended for **high-stakes decision-making** without human oversight.
## Bias, Risks, and Limitations
- The model is **trained on IMDB reviews**, so it may **not generalize well** to other types of sentiment analysis tasks.
- May exhibit **biases present in the training data**.
- Sentiment classification **depends heavily on context**, and the model may misinterpret sarcasm or complex sentences.
### Recommendations
- Users should **evaluate the model** on their specific datasets before deploying in production.
- If biases are detected, consider **fine-tuning on a more diverse dataset**.
## How to Use the Model
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load the fine-tuned model from Hugging Face Hub
model = AutoModelForSequenceClassification.from_pretrained("your-hf-username/imdb-distilbert")
tokenizer = AutoTokenizer.from_pretrained("your-hf-username/imdb-distilbert")
def predict_sentiment(review):
inputs = tokenizer(review, return_tensors="pt", truncation=True, padding=True, max_length=256)
with torch.no_grad():
logits = model(**inputs).logits
prediction = torch.argmax(logits, dim=1).item()
return "Positive" if prediction == 1 else "Negative"
# Example Usage
print(predict_sentiment("This movie was absolutely fantastic!"))
print(predict_sentiment("The acting was terrible, and the story made no sense."))
```
## Training Details
### Training Data
- The model was fine-tuned on the IMDB dataset (50,000 labeled movie reviews).
- The dataset is balanced (25,000 positive and 25,000 negative reviews).
- The training split consisted of 40,000 samples, while 5,000 samples were used for validation.
### Training Procedure
#### Preprocessing
- Tokenized using `distilbert-base-uncased` tokenizer.
- Applied **dynamic padding, truncation, and a max sequence length of 256**.
#### Training Hyperparameters
- **Learning rate:** `5e-5`
- **Batch size:** `16`
- **Epochs:** `2`
- **Optimizer:** AdamW
- **Loss Function:** Cross-Entropy Loss
#### Compute Infrastructure
- **Hardware:** Google Colab T4 GPU
- **Precision:** Mixed precision (`fp16=True` for efficiency)
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
- The model was evaluated on a 5,000-sample test set from the IMDB dataset.
#### Metrics
- **Accuracy:** 90,4%
- **Precision, Recall, F1-score:**
- **Precision:** 92,1%
- **Recall:** 88.2%
- **F1-score:** 90.0%
## Model Examination
- The model performs well on **general sentiment classification** but may struggle with **sarcasm, irony, or very short reviews**.
## Environmental Impact
- **Hardware Type:** Google Colab T4 GPU
- **Training Time:** ~1 hour
- **CO2 Emission Estimate:** [Use ML Impact Calculator](https://mlco2.github.io/impact#compute)
## Citation
If you use this model, please cite:
```bibtex
@article{salonen2025imdb-distilbert,
title={Fine-tuned DistilBERT for Sentiment Analysis on IMDB Reviews},
author={Nikke Salonen},
year={2025}
}
```
## More Information
- **Hugging Face Model Page:** https://huggingface.co/NikkeS/imdb-distilbert/.
- **Dataset:** [IMDB Dataset](https://ai.stanford.edu/~amaas/data/sentiment/)
## Model Card Authors
- [Nikke Salonen]
## Contact
For questions or issues, contact **[email protected]**.
---
This model card provides all necessary details, including **training info, evaluation results, and usage instructions**. Let me know if you'd like any modifications before uploading to **Hugging Face Hub**!