imdb-distilbert / README.md

Update README.md

b5705c3 verified about 1 month ago

4.91 kB

	---
	library_name: transformers
	tags:
	- sentiment-analysis
	- imdb
	- text-classification
	- distilbert
	license: apache-2.0
	datasets:
	- stanfordnlp/imdb
	language:
	- en
	metrics:
	- accuracy
	- precision
	- recall
	- f1
	base_model:
	- distilbert/distilbert-base-uncased
	pipeline_tag: text-classification
	---

	# Model Card for DistilBERT Fine-Tuned on IMDB Sentiment Analysis

	## Model Details

	### Model Description

	This model is a fine-tuned version of `distilbert-base-uncased` on the IMDB movie reviews dataset for binary sentiment classification (positive vs. negative). The model has been trained to classify movie reviews into either positive (1) or negative (0) sentiments.

	- Developed by: Nikke Salonen
	- Finetuned from model: `distilbert-base-uncased`
	- Language(s): English
	- License: Apache 2.0

	### Model Sources
	- Repository: https://huggingface.co/NikkeS/imdb-distilbert/
	- Dataset: [IMDB Dataset](https://ai.stanford.edu/~amaas/data/sentiment/)

	## Uses

	### Direct Use
	- Sentiment analysis of English text reviews.
	- Can be used for opinion mining on movie reviews and similar datasets.

	### Downstream Use
	- Can be fine-tuned further for sentiment classification in other domains (e.g., product reviews, social media sentiment analysis).

	### Out-of-Scope Use
	- Not suitable for languages other than English.
	- Not recommended for high-stakes decision-making without human oversight.

	## Bias, Risks, and Limitations

	- The model is trained on IMDB reviews, so it may not generalize well to other types of sentiment analysis tasks.
	- May exhibit biases present in the training data.
	- Sentiment classification depends heavily on context, and the model may misinterpret sarcasm or complex sentences.

	### Recommendations
	- Users should evaluate the model on their specific datasets before deploying in production.
	- If biases are detected, consider fine-tuning on a more diverse dataset.

	## How to Use the Model

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	import torch

	# Load the fine-tuned model from Hugging Face Hub
	model = AutoModelForSequenceClassification.from_pretrained("your-hf-username/imdb-distilbert")
	tokenizer = AutoTokenizer.from_pretrained("your-hf-username/imdb-distilbert")

	def predict_sentiment(review):
	inputs = tokenizer(review, return_tensors="pt", truncation=True, padding=True, max_length=256)
	with torch.no_grad():
	logits = model(**inputs).logits
	prediction = torch.argmax(logits, dim=1).item()
	return "Positive" if prediction == 1 else "Negative"

	# Example Usage
	print(predict_sentiment("This movie was absolutely fantastic!"))
	print(predict_sentiment("The acting was terrible, and the story made no sense."))
	```

	## Training Details

	### Training Data
	- The model was fine-tuned on the IMDB dataset (50,000 labeled movie reviews).
	- The dataset is balanced (25,000 positive and 25,000 negative reviews).
	- The training split consisted of 40,000 samples, while 5,000 samples were used for validation.

	### Training Procedure
	#### Preprocessing
	- Tokenized using `distilbert-base-uncased` tokenizer.
	- Applied dynamic padding, truncation, and a max sequence length of 256.

	#### Training Hyperparameters
	- Learning rate: `5e-5`
	- Batch size: `16`
	- Epochs: `2`
	- Optimizer: AdamW
	- Loss Function: Cross-Entropy Loss

	#### Compute Infrastructure
	- Hardware: Google Colab T4 GPU
	- Precision: Mixed precision (`fp16=True` for efficiency)

	## Evaluation

	### Testing Data, Factors & Metrics
	#### Testing Data
	- The model was evaluated on a 5,000-sample test set from the IMDB dataset.

	#### Metrics
	- Accuracy: 90,4%
	- Precision, Recall, F1-score:
	- Precision: 92,1%
	- Recall: 88.2%
	- F1-score: 90.0%

	## Model Examination
	- The model performs well on general sentiment classification but may struggle with sarcasm, irony, or very short reviews.

	## Environmental Impact
	- Hardware Type: Google Colab T4 GPU
	- Training Time: ~1 hour
	- CO2 Emission Estimate: [Use ML Impact Calculator](https://mlco2.github.io/impact#compute)

	## Citation
	If you use this model, please cite:
	```bibtex
	@article{salonen2025imdb-distilbert,
	title={Fine-tuned DistilBERT for Sentiment Analysis on IMDB Reviews},
	author={Nikke Salonen},
	year={2025}
	}
	```

	## More Information
	- Hugging Face Model Page: https://huggingface.co/NikkeS/imdb-distilbert/.
	- Dataset: [IMDB Dataset](https://ai.stanford.edu/~amaas/data/sentiment/)

	## Model Card Authors
	- [Nikke Salonen]

	## Contact
	For questions or issues, contact [email protected].

	---

	This model card provides all necessary details, including training info, evaluation results, and usage instructions. Let me know if you'd like any modifications before uploading to Hugging Face Hub!