moondream-2b-Med-Vqa-Finetuned / README.md

Update README.md

2f2be15 verified 2 months ago

2.83 kB

	---
	datasets:
	- flaviagiammarino/vqa-rad
	base_model:
	- vikhyatk/moondream2
	tags:
	- med
	- vqa
	- vqarad
	- finetune
	- vision
	- VLM
	---
	# MoonDream2 Fine-Tuning on Med VQA RAD Dataset

	## Description
	This project fine-tunes the MoonDream2 model on the Med VQA RAD dataset to improve medical visual question answering (VQA) capabilities. The fine-tuning process optimizes performance by adjusting hyperparameters using Optuna and tracks training progress with Weights & Biases (W&B).

	## Training Environment
	- Hardware: NVIDIA GPU (CUDA enabled)
	- Frameworks: PyTorch, Hugging Face Transformers
	- Optimizer: Adam8bit (from bitsandbytes)
	- Batch Processing: DataLoader (Torch)
	- Hyperparameter Tuning: Optuna
	- Logging: Weights & Biases (W&B)
	- Device: CUDA-enabled GPU

	## Dataset
	- Name: Med VQA RAD
	- Content: Medical visual question-answering dataset with radiology images and associated Q&A pairs.
	- Preprocessing: Images are processed through MoonDream2's vision encoder.
	- Tokenization: Text is tokenized with Hugging Face's tokenizer.

	## Training Parameters
	- Model: vikhyatk/MoonDream2
	- Number of Image Tokens: 729
	- Learning Rate (LR): Tuned via Optuna (log-uniform search between 1e-6 and 1e-4)
	- Batch Size: 3
	- Gradient Accumulation Steps: 8 / Batch Size
	- Optimizer: Adam8bit (betas=(0.9, 0.95), eps=1e-6)
	- Loss Function: Cross-entropy loss computed on token-level outputs
	- Scheduler: Cosine Annealing with warm-up (10% of total steps)
	- Epochs: Tuned via Optuna (default: 1-2 epochs)
	- Validation Strategy: Loss-based evaluation on validation set

	## Training Process
	1. Collate Function:
	- Prepares image embeddings using MoonDream2’s vision encoder.
	- Converts question-answer pairs into tokenized sequences.
	- Pads sequences to ensure uniform input length.
	2. Loss Computation:
	- Generates text embeddings.
	- Concatenates image and text embeddings.
	- Computes loss using MoonDream2’s causal language model.
	3. Learning Rate Scheduling:
	- Starts at 0.1 × LR and gradually increases.
	- Uses cosine decay after warm-up.
	4. Hyperparameter Optimization:
	- Optuna optimizes learning rate and epoch count.
	- Trials are pruned if performance is suboptimal.
	5. Logging & Monitoring:
	- W&B logs loss, learning rate, and training progress.

	## Results
	- Best Hyperparameters: Selected via Optuna trials.
	- Final Validation Loss: Computed and logged.
	- Model Performance: Evaluated using token-wise accuracy and qualitative assessment.

	## References
	- [MoonDream2 on Hugging Face](https://huggingface.co/vikhyatk/moondream2)
	- [Med VQA RAD Dataset](https://github.com/med-vqa)
	- [Optuna Documentation](https://optuna.org/)

	---
	datasets:
	- flaviagiammarino/vqa-rad
	base_model:
	- vikhyatk/moondream2
	tags:
	- med
	- vqa
	- vqarad
	- finetune
	- vision
	- VLM
	---
	# MoonDream2 Fine-Tuning on Med VQA RAD Dataset

	## Description
	This project fine-tunes the MoonDream2 model on the Med VQA RAD dataset to improve medical visual question answering (VQA) capabilities. The fine-tuning process optimizes performance by adjusting hyperparameters using Optuna and tracks training progress with Weights & Biases (W&B).

	## Training Environment
	- Hardware: NVIDIA GPU (CUDA enabled)
	- Frameworks: PyTorch, Hugging Face Transformers
	- Optimizer: Adam8bit (from bitsandbytes)
	- Batch Processing: DataLoader (Torch)
	- Hyperparameter Tuning: Optuna
	- Logging: Weights & Biases (W&B)
	- Device: CUDA-enabled GPU

	## Dataset
	- Name: Med VQA RAD
	- Content: Medical visual question-answering dataset with radiology images and associated Q&A pairs.
	- Preprocessing: Images are processed through MoonDream2's vision encoder.
	- Tokenization: Text is tokenized with Hugging Face's tokenizer.

	## Training Parameters
	- Model: vikhyatk/MoonDream2
	- Number of Image Tokens: 729
	- Learning Rate (LR): Tuned via Optuna (log-uniform search between 1e-6 and 1e-4)
	- Batch Size: 3
	- Gradient Accumulation Steps: 8 / Batch Size
	- Optimizer: Adam8bit (betas=(0.9, 0.95), eps=1e-6)
	- Loss Function: Cross-entropy loss computed on token-level outputs
	- Scheduler: Cosine Annealing with warm-up (10% of total steps)
	- Epochs: Tuned via Optuna (default: 1-2 epochs)
	- Validation Strategy: Loss-based evaluation on validation set

	## Training Process
	1. Collate Function:
	- Prepares image embeddings using MoonDream2’s vision encoder.
	- Converts question-answer pairs into tokenized sequences.
	- Pads sequences to ensure uniform input length.
	2. Loss Computation:
	- Generates text embeddings.
	- Concatenates image and text embeddings.
	- Computes loss using MoonDream2’s causal language model.
	3. Learning Rate Scheduling:
	- Starts at 0.1 × LR and gradually increases.
	- Uses cosine decay after warm-up.
	4. Hyperparameter Optimization:
	- Optuna optimizes learning rate and epoch count.
	- Trials are pruned if performance is suboptimal.
	5. Logging & Monitoring:
	- W&B logs loss, learning rate, and training progress.

	## Results
	- Best Hyperparameters: Selected via Optuna trials.
	- Final Validation Loss: Computed and logged.
	- Model Performance: Evaluated using token-wise accuracy and qualitative assessment.

	## References
	- [MoonDream2 on Hugging Face](https://huggingface.co/vikhyatk/moondream2)
	- [Med VQA RAD Dataset](https://github.com/med-vqa)
	- [Optuna Documentation](https://optuna.org/)